|
pandas plot garmin
20190714
Convert a Garmin .FIT file and plot the heartrate of your run
Use gpsbabel to turn your FIT file into CSV:
gpsbabel -t -i garmin_fit -f 97D54119.FIT -o unicsv -F 97D54119.csv
Pandas imports:
import pandas as pd
import math
import matplotlib.pyplot as plt
pd.options.display.width=150
Read the CSV file:
df=pd.read_csv('97D54119.csv',sep=',',skip_blank_lines=False)
Show some points:
df.head(500).tail(10)
No Latitude Longitude Altitude Speed Heartrate Cadence Date Time
490 491 50.855181 4.826737 78.2 2.79 144 78.0 2019/07/13 06:44:22
491 492 50.855136 4.826739 77.6 2.79 147 78.0 2019/07/13 06:44:24
492 493 50.854962 4.826829 76.2 2.77 148 77.0 2019/07/13 06:44:32
493 494 50.854778 4.826951 77.4 2.77 146 78.0 2019/07/13 06:44:41
494 495 50.854631 4.827062 78.0 2.71 143 78.0 2019/07/13 06:44:49
495 496 50.854531 4.827174 79.2 2.70 146 77.0 2019/07/13 06:44:54
496 497 50.854472 4.827249 79.2 2.73 149 77.0 2019/07/13 06:44:57
497 498 50.854315 4.827418 79.8 2.74 149 76.0 2019/07/13 06:45:05
498 499 50.854146 4.827516 77.4 2.67 147 76.0 2019/07/13 06:45:14
499 500 50.853985 4.827430 79.0 2.59 144 75.0 2019/07/13 06:45:22
Function to compute the distance (approximately) :
# function to approximately calculate the distance between 2 points
# from: http://www.movable-type.co.uk/scripts/latlong.html
def rough_distance(lat1, lon1, lat2, lon2):
lat1 = lat1 * math.pi / 180.0
lon1 = lon1 * math.pi / 180.0
lat2 = lat2 * math.pi / 180.0
lon2 = lon2 * math.pi / 180.0
r = 6371.0 #// km
x = (lon2 - lon1) * math.cos((lat1+lat2)/2)
y = (lat2 - lat1)
d = math.sqrt(x*x+y*y) * r
return d
Compute the distance:
ds=[]
(d,priorlat,priorlon)=(0.0, 0.0, 0.0)
for t in df[['Latitude','Longitude']].itertuples():
if len(ds)>0:
d+=rough_distance(t.Latitude,t.Longitude, priorlat, priorlon)
ds.append(d)
(priorlat,priorlon)=(t.Latitude,t.Longitude)
df['CumulativeDist']=ds
Let's plot!
df.plot(kind='line',x='CumulativeDist',y='Heartrate',color='red')
plt.show()
Or multiple columns:
plt.plot( df.CumulativeDist, df.Heartrate, color='red')
plt.plot( df.CumulativeDist, df.Altitude, color='blue')
plt.show()
dataframe pandas
20190202
Turn a dataframe into an array
eg. dataframe cn
cn
asciiname population elevation
128677 Crossett 5507 58
7990 Santa Maria da Vitoria 23488 438
25484 Shanling 0 628
95882 Colonia Santa Teresa 36845 2286
38943 Blomberg 1498 4
7409 Missao Velha 13106 364
36937 Goerzig 1295 81
Turn into an arrary
cn.iloc[range(len(cn))].values
array([['Yuhu', 0, 15],
['Traventhal', 551, 42],
['Velabisht', 0, 60],
['Almorox', 2319, 539],
['Abuyog', 15632, 6],
['Zhangshan', 0, 132],
['Llica de Vall', 0, 136],
['Capellania', 2252, 31],
['Mezocsat', 6519, 91],
['Vars', 1634, 52]], dtype=object)
Sidenote: cn was pulled from city data: cn=df.sample(7)[['asciiname','population','elevation']] .
pandas dataframe
20161004
Turn a pandas dataframe into a dictionary
eg. create a mapping of a 3 digit code to a country name
BEL -> Belgium
CHN -> China
FRA -> France
..
Code:
df=pd.io.parsers.read_table(
'/u01/data/20150215_country_code_iso/country_codes.txt',
sep='|')
c3d=df.set_index('c3')['country'].to_dict()
Result:
c3d['AUS']
'Australia'
c3d['GBR']
'United Kingdom'
eg. read a csv file that has nasty quotes, and save it as tab-separated.
import pandas as pd
import csv
colnames= ["userid", "movieid", "tag", "timestamp"]
df=pd.io.parsers.read_table("tags.csv",
sep=",", header=0, names= colnames,
quoting=csv.QUOTE_ALL)
Write:
df.to_csv('tags.tsv', index=False, sep='\t')
pandas aggregation groupby
20160606
Dataframe aggregation fun
Load the city dataframe into dataframe df.
Summary statistic of 1 column
df.population.describe()
count 2.261100e+04
mean 1.113210e+05
std 4.337739e+05
min 0.000000e+00
25% 2.189950e+04
50% 3.545000e+04
75% 7.402450e+04
max 1.460851e+07
Summary statistic per group
Load the city dataframe into df, then:
t1=df[['country','population']].groupby(['country'])
t2=t1.agg( ['min','mean','max','count'])
t2.sort_values(by=[ ('population','count') ],ascending=False).head(20)
Output:
population
min mean max count
country
US 15002 62943.294138 8175133 2900
IN 15007 109181.708924 12691836 2398
BR 0 104364.320502 10021295 1195
DE 0 57970.979716 3426354 986
RU 15048 101571.065195 10381222 951
CN 15183 357967.030457 14608512 788
JP 15584 136453.906915 8336599 752
IT 895 49887.442136 2563241 674
GB 15024 81065.611200 7556900 625
FR 15009 44418.920455 2138551 616
ES 15006 65588.432282 3255944 539
MX 15074 153156.632735 12294193 501
PH 15066 100750.534884 10444527 430
TR 15058 142080.305263 11174257 380
ID 17504 170359.848901 8540121 364
PL 15002 64935.379421 1702139 311
PK 15048 160409.378641 11624219 309
NL 15071 53064.727626 777725 257
UA 15012 103468.816000 2514227 250
NG 15087 205090.336207 9000000 232
Note on selecting a multilevel column
Eg. select 'min' via tuple ('population','min').
t2[ t2[('population','min')]>50000 ]
population
min mean max count
country
BB 98511 9.851100e+04 98511 1
CW 125000 1.250000e+05 125000 1
HK 288728 3.107000e+06 7012738 3
MO 520400 5.204000e+05 520400 1
MR 72337 3.668685e+05 661400 2
MV 103693 1.036930e+05 103693 1
SB 56298 5.629800e+04 56298 1
SG 3547809 3.547809e+06 3547809 1
ST 53300 5.330000e+04 53300 1
TL 150000 1.500000e+05 150000 1
Turn a dataframe into sql statements
The easiest way is to go via sqlite!
eg. the two dataframes udf and tdf.
import sqlite3
con=sqlite3.connect('txdb.sqlite')
udf.to_sql(name='t_user', con=con, index=False)
tdf.to_sql(name='t_transaction', con=con, index=False)
con.close()
Then on the command line:
sqlite3 txdb.sqlite .dump > create.sql
This is the created create.sql script:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE "t_user" (
"uid" INTEGER,
"name" TEXT
);
INSERT INTO "t_user" VALUES(9000,'Gerd Abrahamsson');
INSERT INTO "t_user" VALUES(9001,'Hanna Andersson');
INSERT INTO "t_user" VALUES(9002,'August Bergsten');
INSERT INTO "t_user" VALUES(9003,'Arvid Bohlin');
INSERT INTO "t_user" VALUES(9004,'Edvard Marklund');
INSERT INTO "t_user" VALUES(9005,'Ragnhild Brännström');
INSERT INTO "t_user" VALUES(9006,'Börje Wallin');
INSERT INTO "t_user" VALUES(9007,'Otto Byström');
INSERT INTO "t_user" VALUES(9008,'Elise Dahlström');
CREATE TABLE "t_transaction" (
"xid" INTEGER,
"uid" INTEGER,
"amount" INTEGER,
"date" TEXT
);
INSERT INTO "t_transaction" VALUES(5000,9008,498,'2016-02-21 06:28:49');
INSERT INTO "t_transaction" VALUES(5001,9003,268,'2016-01-17 13:37:38');
INSERT INTO "t_transaction" VALUES(5002,9003,621,'2016-02-24 15:36:53');
INSERT INTO "t_transaction" VALUES(5003,9007,-401,'2016-01-14 16:43:27');
INSERT INTO "t_transaction" VALUES(5004,9004,720,'2016-05-14 16:29:54');
INSERT INTO "t_transaction" VALUES(5005,9007,-492,'2016-02-24 23:58:57');
INSERT INTO "t_transaction" VALUES(5006,9002,-153,'2016-02-18 17:58:33');
INSERT INTO "t_transaction" VALUES(5007,9008,272,'2016-05-26 12:00:00');
INSERT INTO "t_transaction" VALUES(5008,9005,-250,'2016-02-24 23:14:52');
INSERT INTO "t_transaction" VALUES(5009,9008,82,'2016-04-20 18:33:25');
INSERT INTO "t_transaction" VALUES(5010,9006,549,'2016-02-16 14:37:25');
INSERT INTO "t_transaction" VALUES(5011,9008,-571,'2016-02-28 13:05:33');
INSERT INTO "t_transaction" VALUES(5012,9008,814,'2016-03-20 13:29:11');
INSERT INTO "t_transaction" VALUES(5013,9005,-114,'2016-02-06 14:55:10');
INSERT INTO "t_transaction" VALUES(5014,9005,819,'2016-01-18 10:50:20');
INSERT INTO "t_transaction" VALUES(5015,9001,-404,'2016-02-20 22:08:23');
INSERT INTO "t_transaction" VALUES(5016,9000,-95,'2016-05-09 10:26:05');
INSERT INTO "t_transaction" VALUES(5017,9003,428,'2016-03-27 15:30:47');
INSERT INTO "t_transaction" VALUES(5018,9002,-549,'2016-04-15 21:44:49');
INSERT INTO "t_transaction" VALUES(5019,9001,-462,'2016-03-09 20:32:35');
INSERT INTO "t_transaction" VALUES(5020,9004,-339,'2016-05-03 17:11:21');
COMMIT;
The script doesn't create the indexes (because of Index='False'), so here are the statements:
CREATE INDEX "ix_t_user_uid" ON "t_user" ("uid");
CREATE INDEX "ix_t_transaction_xid" ON "t_transaction" ("xid");
Or better: create primary keys on those tables!
Join two dataframes, sql style
You have a number of users, and a number of transactions against those users. Join these 2 dataframes.
import pandas as pd
User dataframe
ids= [9000, 9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008]
nms=[u'Gerd Abrahamsson', u'Hanna Andersson', u'August Bergsten',
u'Arvid Bohlin', u'Edvard Marklund', u'Ragnhild Br\xe4nnstr\xf6m',
u'B\xf6rje Wallin', u'Otto Bystr\xf6m',u'Elise Dahlstr\xf6m']
udf=pd.DataFrame(ids, columns=['uid'])
udf['name']=nms
Content of udf:
uid name
0 9000 Gerd Abrahamsson
1 9001 Hanna Andersson
2 9002 August Bergsten
3 9003 Arvid Bohlin
4 9004 Edvard Marklund
5 9005 Ragnhild Brännström
6 9006 Börje Wallin
7 9007 Otto Byström
8 9008 Elise Dahlström
Transaction dataframe
tids= [5000, 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5008, 5009, 5010, 5011, 5012,
5013, 5014, 5015, 5016, 5017, 5018, 5019, 5020]
uids= [9008, 9003, 9003, 9007, 9004, 9007, 9002, 9008, 9005, 9008, 9006, 9008, 9008,
9005, 9005, 9001, 9000, 9003, 9002, 9001, 9004]
tamt= [498, 268, 621, -401, 720, -492, -153, 272, -250, 82, 549, -571, 814, -114,
819, -404, -95, 428, -549, -462, -339]
tdt= ['2016-02-21 06:28:49', '2016-01-17 13:37:38', '2016-02-24 15:36:53',
'2016-01-14 16:43:27', '2016-05-14 16:29:54', '2016-02-24 23:58:57',
'2016-02-18 17:58:33', '2016-05-26 12:00:00', '2016-02-24 23:14:52',
'2016-04-20 18:33:25', '2016-02-16 14:37:25', '2016-02-28 13:05:33',
'2016-03-20 13:29:11', '2016-02-06 14:55:10', '2016-01-18 10:50:20',
'2016-02-20 22:08:23', '2016-05-09 10:26:05', '2016-03-27 15:30:47',
'2016-04-15 21:44:49', '2016-03-09 20:32:35', '2016-05-03 17:11:21']
tdf=pd.DataFrame(tids, columns=['xid'])
tdf['uid']=uids
tdf['amount']=tamt
tdf['date']=tdt
Content of tdf:
xid uid amount date
0 5000 9008 498 2016-02-21 06:28:49
1 5001 9003 268 2016-01-17 13:37:38
2 5002 9003 621 2016-02-24 15:36:53
3 5003 9007 -401 2016-01-14 16:43:27
4 5004 9004 720 2016-05-14 16:29:54
5 5005 9007 -492 2016-02-24 23:58:57
6 5006 9002 -153 2016-02-18 17:58:33
7 5007 9008 272 2016-05-26 12:00:00
8 5008 9005 -250 2016-02-24 23:14:52
9 5009 9008 82 2016-04-20 18:33:25
10 5010 9006 549 2016-02-16 14:37:25
11 5011 9008 -571 2016-02-28 13:05:33
12 5012 9008 814 2016-03-20 13:29:11
13 5013 9005 -114 2016-02-06 14:55:10
14 5014 9005 819 2016-01-18 10:50:20
15 5015 9001 -404 2016-02-20 22:08:23
16 5016 9000 -95 2016-05-09 10:26:05
17 5017 9003 428 2016-03-27 15:30:47
18 5018 9002 -549 2016-04-15 21:44:49
19 5019 9001 -462 2016-03-09 20:32:35
20 5020 9004 -339 2016-05-03 17:11:21
Join sql-style: pd.merge
pd.merge( tdf, udf, how='inner', left_on='uid', right_on='uid')
xid uid amount date name
0 5000 9008 498 2016-02-21 06:28:49 Elise Dahlström
1 5007 9008 272 2016-05-26 12:00:00 Elise Dahlström
2 5009 9008 82 2016-04-20 18:33:25 Elise Dahlström
3 5011 9008 -571 2016-02-28 13:05:33 Elise Dahlström
4 5012 9008 814 2016-03-20 13:29:11 Elise Dahlström
5 5001 9003 268 2016-01-17 13:37:38 Arvid Bohlin
6 5002 9003 621 2016-02-24 15:36:53 Arvid Bohlin
7 5017 9003 428 2016-03-27 15:30:47 Arvid Bohlin
8 5003 9007 -401 2016-01-14 16:43:27 Otto Byström
9 5005 9007 -492 2016-02-24 23:58:57 Otto Byström
10 5004 9004 720 2016-05-14 16:29:54 Edvard Marklund
11 5020 9004 -339 2016-05-03 17:11:21 Edvard Marklund
12 5006 9002 -153 2016-02-18 17:58:33 August Bergsten
13 5018 9002 -549 2016-04-15 21:44:49 August Bergsten
14 5008 9005 -250 2016-02-24 23:14:52 Ragnhild Brännström
15 5013 9005 -114 2016-02-06 14:55:10 Ragnhild Brännström
16 5014 9005 819 2016-01-18 10:50:20 Ragnhild Brännström
17 5010 9006 549 2016-02-16 14:37:25 Börje Wallin
18 5015 9001 -404 2016-02-20 22:08:23 Hanna Andersson
19 5019 9001 -462 2016-03-09 20:32:35 Hanna Andersson
20 5016 9000 -95 2016-05-09 10:26:05 Gerd Abrahamsson
Sidenote: fake data creation
This is the way the above fake data was created:
import random
from faker import Factory
fake = Factory.create('sv_SE')
ids=[]
nms=[]
for i in range(0,9):
ids.append(9000+i)
nms.append(fake.name())
print "%d\t%s" % ( ids[i],nms[i])
tids=[]
uids=[]
tamt=[]
tdt=[]
sign=[-1,1]
for i in range(0,21):
tids.append(5000+i)
tamt.append(sign[random.randint(0,1)]*random.randint(80,900))
uids.append(ids[random.randint(0,len(ids)-1)])
tdt.append(str(fake.date_time_this_year()))
print "%d\t%d\t%d\t%s" % ( tids[i], tamt[i], uids[i], tdt[i])
Read data from a zipfile into a dataframe
import pandas as pd
import zipfile
z = zipfile.ZipFile("lending-club-data.csv.zip")
df=pd.io.parsers.read_table(z.open("lending-club-data.csv"), sep=",")
z.close()
pandas distance track gps
20160420
Calculate the cumulative distance of gps trackpoints
Prep:
import pandas as pd
import math
Function to calculate the distance:
# function to approximately calculate the distance between 2 points
# from: http://www.movable-type.co.uk/scripts/latlong.html
def rough_distance(lat1, lon1, lat2, lon2):
lat1 = lat1 * math.pi / 180.0
lon1 = lon1 * math.pi / 180.0
lat2 = lat2 * math.pi / 180.0
lon2 = lon2 * math.pi / 180.0
r = 6371.0 #// km
x = (lon2 - lon1) * math.cos((lat1+lat2)/2)
y = (lat2 - lat1)
d = math.sqrt(x*x+y*y) * r
return d
Read data:
df=pd.io.parsers.read_table("trk.tsv",sep="\t")
# drop some columns (for clarity)
df=df.drop(['track','ele','tm_str'],axis=1)
Sample:
df.head()
lat lon
0 50.848408 4.787456
1 50.848476 4.787367
2 50.848572 4.787275
3 50.848675 4.787207
4 50.848728 4.787189
The prior-latitude column is the latitude column shifted by 1 unit:
df['prior_lat']= df['lat'].shift(1)
prior_lat_ix=df.columns.get_loc('prior_lat')
df.iloc[0,prior_lat_ix]= df.lat.iloc[0]
The prior-longitude column is the longitude column shifted by 1 unit:
df['prior_lon']= df['lon'].shift(1)
prior_lon_ix=df.columns.get_loc('prior_lon')
df.iloc[0,prior_lon_ix]= df.lon.iloc[0]
Calculate the distance:
df['dist']= df[ ['lat','lon','prior_lat','prior_lon'] ].apply(
lambda r : rough_distance ( r[0], r[1], r[2], r[3]) , axis=1)
Calculate the cumulative distance
cum=0
cum_dist=[]
for d in df['dist']:
cum=cum+d
cum_dist.append(cum)
df['cum_dist']=cum_dist
Sample:
df.head()
lat lon prior_lat prior_lon dist cum_dist
0 50.848408 4.787456 50.848408 4.787456 0.000000 0.000000
1 50.848476 4.787367 50.848408 4.787456 0.009831 0.009831
2 50.848572 4.787275 50.848476 4.787367 0.012435 0.022266
3 50.848675 4.787207 50.848572 4.787275 0.012399 0.034665
4 50.848728 4.787189 50.848675 4.787207 0.006067 0.040732
df.tail()
lat lon prior_lat prior_lon dist cum_dist
1012 50.847164 4.788163 50.846962 4.788238 0.023086 14.937470
1013 50.847267 4.788134 50.847164 4.788163 0.011634 14.949104
1014 50.847446 4.788057 50.847267 4.788134 0.020652 14.969756
1015 50.847630 4.787978 50.847446 4.788057 0.021097 14.990853
1016 50.847729 4.787932 50.847630 4.787978 0.011496 15.002349
Onehot encode the categorical data of a data-frame
.. using the pandas get_dummies function.
Data:
import StringIO
import pandas as pd
data_strio=StringIO.StringIO('''category reason species
Decline Genuine 24
Improved Genuine 16
Improved Misclassified 85
Decline Misclassified 41
Decline Taxonomic 2
Improved Taxonomic 7
Decline Unclear 41
Improved Unclear 117''')
df=pd.read_fwf(data_strio)
One hot encode 'category':
cat_oh= pd.get_dummies(df['category'])
cat_oh.columns= map( lambda x: "cat__"+x.lower(), cat_oh.columns.values)
cat_oh
cat__decline cat__improved
0 1 0
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
7 0 1
Do the same for 'reason' :
reason_oh= pd.get_dummies(df['reason'])
reason_oh.columns= map( lambda x: "rsn__"+x.lower(), reason_oh.columns.values)
Combine
Combine the columns into a new dataframe:
ohdf= pd.concat( [ cat_oh, reason_oh, df['species']], axis=1)
Result:
ohdf
cat__decline cat__improved rsn__genuine rsn__misclassified \
0 1 0 1 0
1 0 1 1 0
2 0 1 0 1
3 1 0 0 1
4 1 0 0 0
5 0 1 0 0
6 1 0 0 0
7 0 1 0 0
rsn__taxonomic rsn__unclear species
0 0 0 24
1 0 0 16
2 0 0 85
3 0 0 41
4 1 0 2
5 1 0 7
6 0 1 41
7 0 1 117
Or if the 'drop' syntax on the dataframe is more convenient to you:
ohdf= pd.concat( [ cat_oh, reason_oh,
df.drop(['category','reason'], axis=1) ],
axis=1)
pandas read_data
20160419
Read a fixed-width datafile inline
import StringIO
import pandas as pd
data_strio=StringIO.StringIO('''category reason species
Decline Genuine 24
Improved Genuine 16
Improved Misclassified 85
Decline Misclassified 41
Decline Taxonomic 2
Improved Taxonomic 7
Decline Unclear 41
Improved Unclear 117''')
Turn the string_IO into a dataframe:
df=pd.read_fwf(data_strio)
Check the content:
df
category reason species
0 Decline Genuine 24
1 Improved Genuine 16
2 Improved Misclassified 85
3 Decline Misclassified 41
4 Decline Taxonomic 2
5 Improved Taxonomic 7
6 Decline Unclear 41
7 Improved Unclear 117
The "5-number" summary
df.describe()
species
count 8.000000
mean 41.625000
std 40.177952
min 2.000000
25% 13.750000
50% 32.500000
75% 52.000000
max 117.000000
Drop a column
df=df.drop('reason',axis=1)
Result:
category species
0 Decline 24
1 Improved 16
2 Improved 85
3 Decline 41
4 Decline 2
5 Improved 7
6 Decline 41
7 Improved 117
delta_time pandas
20151207
Add/subtract a delta time
Problem
A number of photo files were tagged as follows, with the date and the time:
20151205_17h48-img_0098.jpg
20151205_18h20-img_0099.jpg
20151205_18h21-img_0100.jpg
..
Turns out that they should be all an hour earlier (reminder: mixing pics from two camera's), so let's create a script to rename these files...
Solution
1. Start
Let's use pandas:
import datetime as dt
import pandas as pd
import re
df0=pd.io.parsers.read_table( '/u01/work/20151205_gran_canaria/fl.txt',sep=",", \
header=None, names= ["fn"])
df=df0[df0['fn'].apply( lambda a: 'img_0' in a )] # filter out certain pics
2. Make parseable
Now add a column to the dataframe that only contains the numbers of the date, so it can be parsed:
df['rawdt']=df['fn'].apply( lambda a: re.sub('-.*.jpg','',a))\
.apply( lambda a: re.sub('[_h]','',a))
Result:
df.head()
fn rawdt
0 20151202_07h17-img_0001.jpg 201512020717
1 20151202_07h17-img_0002.jpg 201512020717
2 20151202_07h17-img_0003.jpg 201512020717
3 20151202_15h29-img_0004.jpg 201512021529
28 20151202_17h59-img_0005.jpg 201512021759
3. Convert to datetime, and subtract delta time
Convert the raw-date to a real date, and subtract an hour:
df['adjdt']=pd.to_datetime( df['rawdt'], format('%Y%m%d%H%M'))-dt.timedelta(hours=1)
Note 20190105: apparently you can drop the 'format' string:
df['adjdt']=pd.to_datetime( df['rawdt'])-dt.timedelta(hours=1)
Result:
fn rawdt adjdt
0 20151202_07h17-img_0001.jpg 201512020717 2015-12-02 06:17:00
1 20151202_07h17-img_0002.jpg 201512020717 2015-12-02 06:17:00
2 20151202_07h17-img_0003.jpg 201512020717 2015-12-02 06:17:00
3 20151202_15h29-img_0004.jpg 201512021529 2015-12-02 14:29:00
28 20151202_17h59-img_0005.jpg 201512021759 2015-12-02 16:59:00
4. Convert adjusted date to string
df['adj']=df['adjdt'].apply(lambda a: dt.datetime.strftime(a, "%Y%m%d_%Hh%M") )
We also need the 'stem' of the filename:
df['stem']=df['fn'].apply(lambda a: re.sub('^.*-','',a) )
Result:
df.head()
fn rawdt adjdt \
0 20151202_07h17-img_0001.jpg 201512020717 2015-12-02 06:17:00
1 20151202_07h17-img_0002.jpg 201512020717 2015-12-02 06:17:00
2 20151202_07h17-img_0003.jpg 201512020717 2015-12-02 06:17:00
3 20151202_15h29-img_0004.jpg 201512021529 2015-12-02 14:29:00
28 20151202_17h59-img_0005.jpg 201512021759 2015-12-02 16:59:00
adj stem
0 20151202_06h17 img_0001.jpg
1 20151202_06h17 img_0002.jpg
2 20151202_06h17 img_0003.jpg
3 20151202_14h29 img_0004.jpg
28 20151202_16h59 img_0005.jpg
5. Cleanup
Drop columns that are no longer useful:
df=df.drop(['rawdt','adjdt'], axis=1)
Result:
df.head()
fn adj stem
0 20151202_07h17-img_0001.jpg 20151202_06h17 img_0001.jpg
1 20151202_07h17-img_0002.jpg 20151202_06h17 img_0002.jpg
2 20151202_07h17-img_0003.jpg 20151202_06h17 img_0003.jpg
3 20151202_15h29-img_0004.jpg 20151202_14h29 img_0004.jpg
28 20151202_17h59-img_0005.jpg 20151202_16h59 img_0005.jpg
6. Generate scripts
Generate the 'rename' script:
sh=df.apply( lambda a: 'mv {} {}-{}'.format( a[0],a[1],a[2]), axis=1)
sh.to_csv('rename.sh',header=False, index=False )
Also generate the 'rollback' script (in case we have to rollback the renaming) :
sh=df.apply( lambda a: 'mv {}-{} {}'.format( a[1],a[2],a[0]), axis=1)
sh.to_csv('rollback.sh',header=False, index=False )
First lines of the rename script:
mv 20151202_07h17-img_0001.jpg 20151202_06h17-img_0001.jpg
mv 20151202_07h17-img_0002.jpg 20151202_06h17-img_0002.jpg
mv 20151202_07h17-img_0003.jpg 20151202_06h17-img_0003.jpg
mv 20151202_15h29-img_0004.jpg 20151202_14h29-img_0004.jpg
mv 20151202_17h59-img_0005.jpg 20151202_16h59-img_0005.jpg
pandas dataframe
20150302
Create an empty dataframe
10
11
12
13
14
15
16
17
| # create
df=pd.DataFrame(np.zeros(0,dtype=[
('ProductID', 'i4'),
('ProductName', 'a50')
]))
# append
df = df.append({'ProductID':1234, 'ProductName':'Widget'},ignore_index=True)
|
Other way
24
25
| columns = ['price', 'item']
df2 = pd.DataFrame(data=np.zeros((0,len(columns))), columns=columns)
|
Quickies
You want to pandas to print more data on your wide terminal window?
pd.set_option('display.line_width', 200)
You want to make the max column width larger?
pd.set_option('max_colwidth',80)
datetime pandas numpy
20141025
Dataframe with date-time index
Create a dataframe df with a datetime index and some random values: (note: see 'simpler' dataframe creation further down)
Output:
In [4]: df.head(10)
Out[4]:
value
2009-12-01 71
2009-12-02 92
2009-12-03 64
2009-12-04 55
2009-12-05 99
2009-12-06 51
2009-12-07 68
2009-12-08 64
2009-12-09 90
2009-12-10 57
[10 rows x 1 columns]
Now select a week of data
Output: watchout selects 8 days!!
In [235]: df[d1:d2]
Out[235]:
value
2009-12-10 99
2009-12-11 70
2009-12-12 83
2009-12-13 90
2009-12-14 60
2009-12-15 64
2009-12-16 59
2009-12-17 97
[8 rows x 1 columns]
In [236]: df[d1:d1+dt.timedelta(days=7)]
Out[236]:
value
2009-12-10 99
2009-12-11 70
2009-12-12 83
2009-12-13 90
2009-12-14 60
2009-12-15 64
2009-12-16 59
2009-12-17 97
[8 rows x 1 columns]
In [237]: df[d1:d1+dt.timedelta(weeks=1)]
Out[237]:
value
2009-12-10 99
2009-12-11 70
2009-12-12 83
2009-12-13 90
2009-12-14 60
2009-12-15 64
2009-12-16 59
2009-12-17 97
[8 rows x 1 columns]
Postscriptum: a simpler way of creating the dataframe
An index of a range of dates can also be created like this with pandas:
pd.date_range('20091201', periods=31)
Hence the dataframe:
df=pd.DataFrame(np.random.randint(50,100,31), index=pd.date_range('20091201', periods=31))
pandas dataframe numpy
20141019
Add two dataframes
Add the contents of two dataframes, having the same index
a=pd.DataFrame( np.random.randint(1,10,5), index=['a', 'b', 'c', 'd', 'e'], columns=['val'])
b=pd.DataFrame( np.random.randint(1,10,3), index=['b', 'c', 'e'],columns=['val'])
a
val
a 5
b 7
c 8
d 8
e 1
b
val
b 9
c 2
e 5
a+b
val
a NaN
b 16
c 10
d NaN
e 6
a.add(b,fill_value=0)
val
a 5
b 16
c 10
d 8
e 6
Read/write csv
Read:
pd.read_csv('in.csv')
Write:
<yourdataframe>.to_csv('out.csv',header=False, index=False )
Load a csv file
Load the following csv file. Difficulty: the date is spread over 3 fields.
2014, 8, 5, IBM, BUY, 50,
2014, 10, 9, IBM, SELL, 20 ,
2014, 9, 17, PG, BUY, 10,
2014, 8, 15, PG, SELL, 20 ,
The way I implemented it:
10
11
12
13
14
15
16
17
18
19
20
21
22
| # my way
ls_order_col= [ 'year', 'month', 'day', 'symbol', 'buy_sell', 'number','dummy' ]
df_mo=pd.read_csv(s_filename, sep=',', names=ls_order_col, skipinitialspace=True, index_col=False)
# add column of type datetime
df_mo['date']=pd.to_datetime(df_mo.year*10000+df_mo.month*100+df_mo.day,format='%Y%m%d')
# drop some columns
df_mo.drop(['dummy','year','month','day'], axis=1, inplace=True)
# order by datetime
df_mo.sort(columns='date',inplace=True )
print df_mo
|
An alternative way,... it's better because the date is converted on reading, and the dataframe is indexed by the date.
| |