aardvark.code
 
07_timing
20160608

Timing: compare execution of two similar processes

What?

Wise.io have created paratext to bump up the speed of CSV parsing. See this article or github. Here is a comparison of paratext and pandas loading a big CSV file: there definitively is a difference in performance, though not outspokenly big. See the following chart: paratext in blue, pandas in red. Y-axis is the time taken, lower is better. X-axis is number of lines read from 1000 to 30 million.

This test was executed on a Xeon CPU E5-2660 @ 2.20GHz (8 cores), and 61 gig mem available. The data-file loaded was the 'train.csv' of Kaggle's Expedia Hotel competition.

Prerequisite

To run this aardvark.code example you need to have following software installed on your system:

  • python and pandas
  • paratext (see above github link on how to install)
  • R (for plotting)
  • aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/timing/aardvark.code

Execute

Execute aardvark. Wait a while. Then admire the chart:

$ display chart.png

The aardvark.code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
##================================================================================
##== pandas_load.py 
import pandas as pd
df=pd.io.parsers.read_table("sample.csv",sep=',')

##================================================================================
##== para_load.py 
import pandas as pd
import paratext
df = paratext.load_csv_to_pandas('sample.csv')

##================================================================================
##== plot.R

png('chart.png',width=800, height=400) 
df<-read.table('timing.csv', sep='|', header=F)  

x=df[df$V1=='pandas_load.py',c('V2')]
y1=df[df$V1=='pandas_load.py',c('V3')]
y2=df[df$V1=='para_load.py',c('V3')]

plot(x,y1,type='b',pch=19,col='red', main="Load CSV: Pandas vs Paratext", xlab="numlines", ylab="time")
lines(x,y2,type='b',pch=19,col='blue')
dev.off()


##================================================================================
##== aardvark.sh  
#!/bin/bash
PY_EXE="/usr/bin/python2"

rm timing.csv
for N in 1000 10000 25000 50000 75000 100000 250000 500000 750000 1000000 2500000 \
         5000000 7500000 10000000 15000000 20000000 25000000 30000000 
do
    head -$N train.csv > sample.csv
    for PYSCRIPT in pandas_load.py para_load.py 
    do
       /usr/bin/time -f "$PYSCRIPT|$N|%e|%U|%S" $PY_EXE $PYSCRIPT  2>> timing.csv
    done
done 

# plot the result
/usr/bin/R --slave --vanilla --quiet -f ./plot.R

Closing

The timing.csv data-file produced by the script, and consumed by the plot.R code, looks like this:

pandas_load.py|1000|0.38|0.24|0.12
para_load.py|1000|0.39|0.28|0.10
pandas_load.py|10000|0.41|0.26|0.14
para_load.py|10000|0.40|0.30|0.12
pandas_load.py|25000|0.66|0.43|0.22
para_load.py|25000|0.49|0.41|0.13
pandas_load.py|50000|0.53|0.39|0.12
para_load.py|50000|0.47|0.43|0.15
pandas_load.py|75000|0.59|0.40|0.18
para_load.py|75000|0.52|0.55|0.13
pandas_load.py|100000|0.67|0.52|0.14
..
..
 
Notes by Data Munging Ninja. Generated on akalumba:sync/20151223_datamungingninja/aardvarkcode at 2018-02-24 12:57