|
Timing: compare execution of two similar processes
What?
Wise.io have created paratext to bump up the speed of CSV parsing. See this article or github. Here is a comparison of paratext and pandas loading a big CSV file: there definitively is a difference in performance, though not outspokenly big. See the following chart: paratext in blue, pandas in red. Y-axis is the time taken, lower is better. X-axis is number of lines read from 1000 to 30 million.
This test was executed on a Xeon CPU E5-2660 @ 2.20GHz (8 cores), and 61 gig mem available. The data-file loaded was the 'train.csv' of Kaggle's Expedia Hotel competition.
Prerequisite
To run this aardvark.code example you need to have following software installed on your system:
- python and pandas
- paratext (see above github link on how to install)
- R (for plotting)
- aardvark.
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/timing/aardvark.code
Execute
Execute aardvark. Wait a while. Then admire the chart:
$ display chart.png
The aardvark.code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| ##================================================================================
##== pandas_load.py
import pandas as pd
df=pd.io.parsers.read_table("sample.csv",sep=',')
##================================================================================
##== para_load.py
import pandas as pd
import paratext
df = paratext.load_csv_to_pandas('sample.csv')
##================================================================================
##== plot.R
png('chart.png',width=800, height=400)
df<-read.table('timing.csv', sep='|', header=F)
x=df[df$V1=='pandas_load.py',c('V2')]
y1=df[df$V1=='pandas_load.py',c('V3')]
y2=df[df$V1=='para_load.py',c('V3')]
plot(x,y1,type='b',pch=19,col='red', main="Load CSV: Pandas vs Paratext", xlab="numlines", ylab="time")
lines(x,y2,type='b',pch=19,col='blue')
dev.off()
##================================================================================
##== aardvark.sh
#!/bin/bash
PY_EXE="/usr/bin/python2"
rm timing.csv
for N in 1000 10000 25000 50000 75000 100000 250000 500000 750000 1000000 2500000 \
5000000 7500000 10000000 15000000 20000000 25000000 30000000
do
head -$N train.csv > sample.csv
for PYSCRIPT in pandas_load.py para_load.py
do
/usr/bin/time -f "$PYSCRIPT|$N|%e|%U|%S" $PY_EXE $PYSCRIPT 2>> timing.csv
done
done
# plot the result
/usr/bin/R --slave --vanilla --quiet -f ./plot.R
|
Closing
The timing.csv data-file produced by the script, and consumed by the plot.R code, looks like this:
pandas_load.py|1000|0.38|0.24|0.12
para_load.py|1000|0.39|0.28|0.10
pandas_load.py|10000|0.41|0.26|0.14
para_load.py|10000|0.40|0.30|0.12
pandas_load.py|25000|0.66|0.43|0.22
para_load.py|25000|0.49|0.41|0.13
pandas_load.py|50000|0.53|0.39|0.12
para_load.py|50000|0.47|0.43|0.15
pandas_load.py|75000|0.59|0.40|0.18
para_load.py|75000|0.52|0.55|0.13
pandas_load.py|100000|0.67|0.52|0.14
..
..
| |