aardvark.code

Pages:
01_intro
02_install
03_simple_example
04_sqlite
05_postgres
06_spark
07_timing
08_best_tool
09_docker
10_static_ws
11_code_gen
12_append
99_faq

One Page

01_intro

20160515

TL;DR

Aardvark is about putting a bunch of code files together in one file, and executing one command to do all that's necessary (compile, execute, launch on a cluster, ship accross the network,...) to produce the desired output.

Stop being a manager of files, but concentrate on code writing!

Detail

The 'itch' that led to me to the 'scratch' was the different development environments I had to open up, to run relatively simple code on a spark cluster, use python to copy the results from hdfs, and then use R to plot a nice chart from it. Noticing that switching between contexts, was not helping my concentration nor focus, I decided to put all the code together in one big code file, and use a utility (ie. aardvark) to split it up into smaller files.

So I 'cat'-ted all my source files into one file named aardvark.code, each entry separated by a line containing

##== <filename>

and ran aardvark on it.

NOTE: on Wed 1st Jun, the split pattern was changed from ##-- to ##== !

Now the number of windows opened in my development environment was reduced to just two terminal windows: one running vi to edit the code, and the other one to kick off the 'aardvark' command.

In this one file, aardvark.code you can also add your data, documentation, etc ..., it's like taking the object-orientated concept of encapsulation a step further: encapsulate not only the code, but also the data, documentation, and compilation/execution/test instructions together in one single file.

Another idea is to put what is important, or what changes a lot, close to the top, like a 'TL;DR' section or an executive summary.

Advantages

less file management chores to do, concentrate on writing code
less context switching (from eclipse to r-studio to idle to the command-line to firefox to.. ) is good for the brain (google 'nytimes unitasking distraction': doing less but getting more done)
use the best bits of each world: db for data-manipulation, python as swiss-army, R for plotting charts, markdown for documentation, csv for data, golang for speed, .., and glue it together using aardvark
reduce your mouse usage, keep those fingers on the keyboard, and concentrate on the code
no need to wonder what script needs to be executed for kicking of the code in that directory: by convention it is always aardvark.sh
easy to work on a new (or refactored) version of a project while keeping the old one ready for execution and testing (just have 2 aardvark files eg. aardvark.code and old.code)

Writing code the 'aardvark' style is ideal for a number of relatively short scripts, that have disparate execution environments, and can all be executed from the command line.

02_install

20160515

Install aardvark

Go get style

With your $GOPATH variable properly set:

$ go get github.com/dtmngngnnj/aardvark
$ cp $GOPATH/bin/aardvark ~/bin

Manually

Grab a copy of aardvark.go from github, compile it, and copy the resulting executable to your bin directory:

$ wget https://github.com/dtmngngnnj/aardvark/raw/master/aardvark.go
$ go build aardvark.go 
$ cp aardvark ~/bin

03_simple_example

20160515

Simple example: Hello World

What?

In following example aardvark extracts 3 files from the aardvark.code file: an R-, python- and bash-script. Then the aardvark.sh (the bash script) is executed. The R-script emits 'hello world', which gets capitalized in the Python script.

Prerequisite

For this example you need to have following software installed on your computer:

R
python
aardvark

In a newly created directory (aka folder), put the file aardvark.code, that you can grab like this:

wget http://data.munging.ninja/aardvarkcode/simple_example/aardvark.code

Here's the content:

##================================================================================
##== generate_data.R
cat ("hello world!")

##================================================================================
##== capitalize.py 
#!/usr/bin/python 

import sys
import string

for line in sys.stdin:
    print string.capwords(line) 

##================================================================================
##== aardvark.sh  
#!/bin/bash
R_EXE="/usr/bin/R --slave --vanilla --quiet"
PY_EXE="/usr/bin/python2"

$R_EXE -f ./generate_data.R | $PY_EXE ./capitalize.py

Execute aardvark. After some housekeeping messages, you'll see:

$ aardvark 
..
..
Hello World!

04_sqlite

20160515

Sqlite example

What?

In following example aardvark stores one entry's content in the tag-dictionary, and extracts a 3 files from the aardvark.code file:

the create_load_exec.sql script
the aardvark.sh script (which gets auto-executed when aardvark finishes writing the files)
the data.csv file

The executed script pumps data into a sqlite database, and runs a sql query on it.

Prerequisite

For this example you need to have following software installed on your computer:

sqlite3
aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/sqlite/aardvark.code

The $key and value

Look at the code: when aardvark finds a 'filename' that starts with a dollar '$' (eg $sql), it is not considered a file but a key/value pair. The value (content) is stored in a dictionary under the key (eg. '$sql'). Further down in the 'aardvark.code' file, this content is pulled into a script by the identifier '[[$sql]]'.

Execute

Execute aardvark. After some housekeeping messages, you'll see the result from the query:

$ aardvark 
..
..
TNM SCRM    Magallanes & Antártica
SMB SCSB    Magallanes & Antártica
WPR SCFM    Magallanes & Antártica
PNT SCNT    Magallanes & Antártica
WPU SCGZ    Magallanes & Antártica
PUQ SCCI    Magallanes & Antártica

The aardvark.code file

##== $sql =====================================================================
select iata,airport_name,region
from   t_airport
where  region like '%ca' ;
##=============================================================================
##== create_load_exec.sql -----------------------------------------------------
create table t_airport (
     iata varchar(8)
    ,icao varchar(8)
    ,city_served varchar(64)
    ,region varchar(64)
    ,airport_name varchar(128)
    );
.mode csv
.import data.csv t_airport 
.mode tabs
[[$sql]]
##=============================================================================
##== aardvark.sh --------------------------------------------------------------
#!/bin/bash 
rm test.db 
cat create_load_exec.sql | sqlite3 test.db  
##=============================================================================
##== data.csv -----------------------------------------------------------------
WAP,SCAP,Alto Palena,Los Lagos,SCAP
ZUD,SCAC,Ancud,Los Lagos,SCAC
TNM,SCRM,Antarctica,Magallanes & Antártica,SCRM
ANF,SCFA,Antofagasta,Antofagasta,SCFA
ARI,SCAR,Arica,Arica & Parinacota,SCAR
BBA,SCBA,Balmaceda,Aisén,SCBA
CJC,SCCF,Calama,Antofagasta,SCCF
WCA,SCST,Castro,Los Lagos,SCST
SMB,SCSB,Cerro Sombrero,Magallanes & Antártica,SCSB
WCH,SCTN,Chaitén,Los Lagos,SCTN
CNR,SCRA,Chañaral,Atacama,SCRA
CCH,SCCC,Chile Chico,Aisén,SCCC
YAI,SCCH,Chillán,Biobío,SCCH
GXQ,SCCY,Coihaique,Aisén,SCCY
LGR,SCHR,Cochrane,Aisén,SCHR
CCP,SCIE,Concepción,Biobío,SCIE
CPO,SCHA,Copiapó,Atacama,SCHA
COW,SCQB,Coquimbo,Coquimbo,SCQB
ZCQ,SCIC,Curicó,Maule,SCIC
ESR,SCES,El Salvador,Atacama,SCES
FFU,SCFT,Futaleufú,Los Lagos,SCFT
IQQ,SCDA,Iquique,Tarapacá,SCDA
IPC,SCIP,Isla de Pascua,Valparaíso,SCIP
LSC,SCSE,La Serena,Coquimbo,SCSE
ZLR,SCLN,Linares,Maule,SCLN
LOB,SCAN,Los Andes,Valparaíso,SCAN
LSQ,SCAG,Los Ángeles,Biobío,SCAG
ZOS,SCJO,Osorno,Los Lagos,SCJO
OVL,SCOV,Ovalle,Coquimbo,SCOV
WPR,SCFM,Porvenir,Magallanes & Antártica,SCFM
ZPC,SCPC,Pucón,Araucanía,SCPC
WPA,SCAS,Puerto Aisén,Aisén,SCAS
PMC,SCTE,Puerto Montt,Los Lagos,SCTE
PNT,SCNT,Puerto Natales,Magallanes & Antártica,SCNT
WPU,SCGZ,Puerto Williams,Magallanes & Antártica,SCGZ
PUQ,SCCI,Punta Arenas,Magallanes & Antártica,SCCI
QRC,SCRG,Rancagua,OHiggins,SCRG
SSD,SCSF,San Felipe,Valparaíso,SCSF
SCL,SCEL,Santiago,Santiago Metropolitan,SCEL
ULC,SCTI,Santiago,Santiago Metropolitan,SCTI
TLX,SCTL,Talca,Maule,SCTL
ZCO,SCTC,Temuco,Araucanía,SCTC
TOQ,SCBE,Tocopilla,Antofagasta,SCBE
ZAL,SCVD,Valdivia,Los Ríos,SCVD
VLR,SCLL,Vallenar,Atacama,SCLL
VAP,SCVA,Valparaíso,Valparaíso,SCVA
KNA,SCVM,Viña del Mar - Concón,Valparaíso,SCVM

05_postgres

20160515

Postgresql / Java / R example

What?

Java is used to run a sql-query on a Postgres database. The result is loaded into a data.frame in R, and a pie chart is plotted. Note: the figures are the sum of city populations, not the whole country population, be carefull before you broadcast these 'facts' !

Prerequisite

For this example you need to have following software installed on your computer:

you have a postgres db running with the city data inserted into t_city, as described on page Load City Data (click on the 'sql' tab)
the postgres jdbc jar file
java jdk
R
aardvark

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/sqljavar/aardvark.code

Look at the following code: the lines of sql code will get inserted into the java code (via tag dictionary). The java program prints out the resultset, which is piped into a file, which gets read by R, and which produces a colorfoul pie chart from it.

Execute

Execute aardvark. Have a look at the produced pie chart:

$ display pie.png

The aardvark.code

##================================================================================
##== $sqlquery_java  =============================================================

" select country, sum(population) as sum_pop "+
" from t_city "+
" where country in ('AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR',"+
"          'GB', 'GR', 'HR', 'HU', 'IE', 'IT', 'LT', 'LU', 'LV', 'MT', 'NL',"+
"          'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'AN')"+
" group by country"+
" order by 2 desc;"

##================================================================================
##== $print_result_java  =========================================================

            System.out.printf("%s\t%d\n", rs.getString(1), rs.getInt(2));

##================================================================================
##== plot.R ======================================================================
topn=7 # experiment: eg. change to top-10 or top-5
df<-read.table('result.csv',sep="\t",header=F)
colnames(df)=c("country", "population") 

pd=rbind( df[1:topn,], data.frame(country="Rest",
                                  population=sum(df[(topn+1):nrow(df),"population"])) )
pct <- round(pd$population/sum(pd$population)*100) 
pd$label=paste(pd$country," (",pct,"%)",sep="") 

#x11(width=800, height=300)    
png('pie.png',width=800, height=400)
par(mfrow = c(1, 2))
pie(pd$population,labels=pd$label,main="EU population before Brexit",
    col=rainbow(nrow(pd))) 

# drop GB 
df<-df[ df$country!='GB',]
pd=rbind( df[1:topn,], data.frame(country="Rest",
                                  population=sum(df[(topn+1):nrow(df),"population"])) )
pct <- round(pd$population/sum(pd$population)*100) 
pd$label=paste(pd$country," (",pct,"%)",sep="") 

pie(pd$population,labels=pd$label,main="EU population after Brexit", 
    col=rainbow(nrow(pd))) 
dev.off()

##================================================================================
##== dirty/query/Query.java ======================================================
package query;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

public class Query {
    public static void main( String args[]) {
        Connection con = null;
        Statement stmt = null;
        try {
            Class.forName("org.postgresql.Driver");
            con = DriverManager.getConnection("jdbc:postgresql://172.16.1.43:5432/dmn",
                                              "dmn", "dmn");
            con.setAutoCommit(false);

            stmt = con.createStatement();
            ResultSet rs = stmt.executeQuery( 
[[$sqlquery_java]]
            );

            while ( rs.next() ) {
[[$print_result_java]]
            }
            rs.close();
            stmt.close();
            con.close();
        } catch ( Exception e ) {
            System.err.println( e.getClass().getName()+": "+ e.getMessage() );
            System.exit(0);
       }
     }
}

##================================================================================
##== aardvark.sh =================================================================
#!/bin/bash 

export DB_RESULT="result.csv"

# --------------------------------------------------------------
# Part 1: compile the java file, and run it (conditionally)
export POSTGRESJDBC="/opt/jdbc/postgres/postgresql-9.4.1208.jar"

S="query/Query.java"
T=${S%.java}.class
E=${S%.java}

# compile: but only if java code is younger then class
S_AGE=`stat -c %Y "dirty/"$S`
T_AGE=`stat -c %Y "dirty"/$T`
if [ -z $T_AGE ] || [ $T_AGE -le $S_AGE ]
then
    echo "## Compiling"
    (cd dirty; javac $S) 
fi

# check if class file was produced
if [ ! -e "dirty/"$T ] 
then
    echo "## '$T' doesn't exist, can't run it." 
    exit 1
fi

# execute
echo "Fetching data from DB"
java -cp $POSTGRESJDBC:dirty $E $* > $DB_RESULT

# --------------------------------------------------------------
# Part 2: kick off R 
echo "Plotting" 
R_EXE="/usr/bin/R --slave --vanilla --quiet"
$R_EXE -f ./plot.R 

06_spark

20160528

Spark example

What?

This example is about running a Spark Scala job on the cluster.

The same NYC taxi data is used, as was described in article 'All Cylinders', but now to calculate the average tip per ride per weekday. Also see that article for more information about the 'build.sbt' and 'assembly.sbt' files.

The final barchart produced looks like this:

Prerequisite

The Scala build tool ('sbt') has been installed on your system, as well as spark, a hadoop-client (for accessing hdfs), spark and finally R for plotting. And of course aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/spark/aardvark.code

The two most important files are on top:

the scala query
the plotting of the chart using R

Execute

Execute aardvark. Go for coffee. Come back. Have a look at the chart:

$ display barchart.png

The aardvark.code

##======================================================================== 
##== $query_scala

// input is 'in_rdd', output is 'out_rdd'

// for every line in the file (except the header), split it into fields,
// and 'emit' a tuple containing 
//     key:   day-of-week,  (prepended with number for sorting eg. "3-WED") 
//     value: (1, tip_amount) 
val ride=in_rdd.filter( !_.startsWith("VendorID") ).
    map( line => {
            val spl=line.split(",")
            val dateFmt= DateTimeFormatter.ofPattern("yyyy-MM-dd")
            val dt=LocalDate.parse( spl(1).substring(0,10), dateFmt)

            val dows=dt.getDayOfWeek().toString().substring(0,3)
            val down=dt.getDayOfWeek().getValue()
            ( s"$down-$dows", (1, spl(15).toDouble) )
       })

// sum up, per day-of-week
val tuple=ride.reduceByKey( (a,b) => (a._1+b._1, a._2+b._2))
    
// output: divide tips by num-rides, to get average
val out_rdd=tuple.map( r => {       
    val (k,v)=(r._1,r._2)
    if (v._1!=0) (k, v._2/v._1.toDouble) 
    else (k, 0) 
    } )

##======================================================================== 
##== plot.R

png('barchart.png',width=800, height=400) 
df<-read.table('output.txt', sep=',', header=F)  
names(df)<-c("dow","val") 
dfo=df[order(df$dow),]
dfo$dow=sub('^..','',dfo$dow)
barplot( dfo$val, names.arg=dfo$dow, 
         main="Average tip per ride",sub="2015" )
dev.off()

##======================================================================== 
##== Taxi.scala

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger

import java.time.LocalDate
import java.time.format.DateTimeFormatter

object Taxi {

    def main(arg: Array[String]) {
        var logger = Logger.getLogger(this.getClass())

        // Arguments
        if (arg.length < 2) {
            logger.error("No input/output path!")
            System.err.println("No input/output path!")
            System.exit(1)
        }
        val inpath = arg(0)
        val outpath = arg(1)

        // setup sparkcontext
        val jobname = "Taxi"
        val conf = new SparkConf().setAppName(jobname)
        val sc = new SparkContext(conf)

        logger.info(s"Job=${jobname} Inpath=${inpath} Outpath=${outpath} " )

        val in_rdd=sc.textFile(inpath) // the taxi file
[[$query_scala]]
        out_rdd.saveAsTextFile(outpath)
    }
}


##======================================================================== 
##== build.sbt

mainClass in assembly := Some("Taxi") 
jarName in assembly := "taxi.jar"

lazy val root = (project in file(".")).
  settings(
    name := "taxi",
    version := "1.0"
)

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
)

##======================================================================== 
##== project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

##======================================================================== 
##== aardvark.sh 
#!/bin/bash

# *********************************************************
# *** PART 0: checks before running ***********************

if [ -z $HADOOP_HOME ]; then  
    echo "Variable 'HADOOP_HOME' is not set!"
    exit 1
fi

if [ -z $SPARK_HOME ]; then  
    echo "Variable 'SPARK_HOME' is not set!"
    exit 1
fi
 
# *********************************************************
# *** PART 1: assemble the jar file ***********************
# compare age of source (scala file) and target (jar file) 
S_DATE=`stat -c %Y Taxi.scala`
T_DATE=0
JARFILE=`ls target/scala*/taxi.jar`
if [ ! -z $JARFILE ] 
then
    T_DATE=`stat -c %Y $JARFILE`
fi
if [ $T_DATE -le $S_DATE ]
then
    echo "*** sbt assembly ***"
    echo "(if this is the first run, go for a coffee break)"
    sbt assembly 
fi 

# *********************************************************
# *** PART 2: launch jar on the spark cluster *************
# condition 1: the jarfile should exist
JARFILE=`ls target/scala*/taxi.jar`
if [ ! -f $JARFILE ] 
then
    echo "'$JARFILE' doesn't exist, can't run it." 
    exit 1
fi

# condition 2: the jar file should be younger than 
#              the scala sourcefile
S_DATE=`stat -c %Y Taxi.scala`
T_DATE=`stat -c %Y $JARFILE`

if [ $T_DATE -le $S_DATE ]
then
    echo "'$JARFILE' is older than source, not running" 
    exit 1
fi

# define job input/output paths
OUTPUT_PATH=hdfs:///user/wildadm/tip_per_ride
INPUT_PATH=hdfs:///user/wildadm/20160421_nyc_taxi
#INPUT_PATH=hdfs:///user/wildadm/20160421_nyc_taxi_subset

# PRE-LAUNCH: delete the output directory 
$HADOOP_HOME/bin/hdfs dfs -rm -r tip_per_ride

# LAUNCH
$SPARK_HOME/bin/spark-submit --master yarn-cluster \
    --num-executors 12 \
    target/scala-2.10/taxi.jar \
    $INPUT_PATH $OUTPUT_PATH


# *********************************************************
# *** PART 3: post-run, fetch data from hdfs **************
$HADOOP_HOME/bin/hdfs dfs -cat $OUTPUT_PATH/part* |\
    sed -e 's/^(//' -e 's/)$//' > output.txt


# *********************************************************
# *** PART 4: plot the output *****************************
/usr/bin/R --slave --vanilla --quiet -f ./plot.R


# *********************************************************
# *** THE END *********************************************
echo "Done!"

07_timing

20160608

Timing: compare execution of two similar processes

What?

Wise.io have created paratext to bump up the speed of CSV parsing. See this article or github. Here is a comparison of paratext and pandas loading a big CSV file: there definitively is a difference in performance, though not outspokenly big. See the following chart: paratext in blue, pandas in red. Y-axis is the time taken, lower is better. X-axis is number of lines read from 1000 to 30 million.

This test was executed on a Xeon CPU E5-2660 @ 2.20GHz (8 cores), and 61 gig mem available. The data-file loaded was the 'train.csv' of Kaggle's Expedia Hotel competition.

Prerequisite

To run this aardvark.code example you need to have following software installed on your system:

python and pandas
paratext (see above github link on how to install)
R (for plotting)
aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/timing/aardvark.code

Execute

Execute aardvark. Wait a while. Then admire the chart:

$ display chart.png

The aardvark.code

##================================================================================
##== pandas_load.py 
import pandas as pd
df=pd.io.parsers.read_table("sample.csv",sep=',')

##================================================================================
##== para_load.py 
import pandas as pd
import paratext
df = paratext.load_csv_to_pandas('sample.csv')

##================================================================================
##== plot.R

png('chart.png',width=800, height=400) 
df<-read.table('timing.csv', sep='|', header=F)  

x=df[df$V1=='pandas_load.py',c('V2')]
y1=df[df$V1=='pandas_load.py',c('V3')]
y2=df[df$V1=='para_load.py',c('V3')]

plot(x,y1,type='b',pch=19,col='red', main="Load CSV: Pandas vs Paratext", xlab="numlines", ylab="time")
lines(x,y2,type='b',pch=19,col='blue')
dev.off()


##================================================================================
##== aardvark.sh  
#!/bin/bash
PY_EXE="/usr/bin/python2"

rm timing.csv
for N in 1000 10000 25000 50000 75000 100000 250000 500000 750000 1000000 2500000 \
         5000000 7500000 10000000 15000000 20000000 25000000 30000000 
do
    head -$N train.csv > sample.csv
    for PYSCRIPT in pandas_load.py para_load.py 
    do
       /usr/bin/time -f "$PYSCRIPT|$N|%e|%U|%S" $PY_EXE $PYSCRIPT  2>> timing.csv
    done
done 

# plot the result
/usr/bin/R --slave --vanilla --quiet -f ./plot.R

Closing

The timing.csv data-file produced by the script, and consumed by the plot.R code, looks like this:

pandas_load.py|1000|0.38|0.24|0.12
para_load.py|1000|0.39|0.28|0.10
pandas_load.py|10000|0.41|0.26|0.14
para_load.py|10000|0.40|0.30|0.12
pandas_load.py|25000|0.66|0.43|0.22
para_load.py|25000|0.49|0.41|0.13
pandas_load.py|50000|0.53|0.39|0.12
para_load.py|50000|0.47|0.43|0.15
pandas_load.py|75000|0.59|0.40|0.18
para_load.py|75000|0.52|0.55|0.13
pandas_load.py|100000|0.67|0.52|0.14
..
..

99_faq

20160629

Frequently asked questions

Why name it 'aardvark' ?

The first proper word on page one in the dictionary is important! That's one aspect of what aardvark is about: putting first what's most important. Also: it will show up as one of the first files when doing 'ls -l' ...

What is a filename that starts with a hash?

It's a key, and its content will not be written to file, but stored in a tag-dictionary. See the sqlite example.

What is this `[[$sql]]` notation ?

In your code you can pull in the text of previously stored tags. See previous question and also the sqlite example.

What is the difference between aardvark, aardvark.code and aardvark.sh ?

aardvark is the utility that does the splitting
aardvark.code is your file containing R, python, java, scala, ... code
aardvark.sh is the script that is executed by aardvark after splitting the aardvark.code file. Best include aardvark.sh as code in your aardvark.code file.

08_best_tool

20161007

Use the best tool for the job

What?

Instead of using a java CSV library, use Python Pandas to preprocess a complex CSV file (ie. one with embedded comma's), and to write it out as tab-separated fields, dropping some columns while we are at it.

Then use Java to read the simply splittable TSV file, and perform aggregation on it, using java8 streams.

Detail about the java8 Aggregation

Read the data in streaming fashion, converting every line to a City record, and filtering out the EU28 countries:

        Path p=Paths.get("cities.tsv");
        List<City>ls = Files.readAllLines(p, Charset.defaultCharset())
             .stream()
             .map( line -> City.digestLine(line))
             .filter( c -> eu28.contains(c.country) )   // only retain EU28 countries 
             .collect( Collectors.toList() ) ; 
        System.out.println("citylist contains: " + ls.size() + " records.");

Then perform the aggregation:

        // aggregate: sum population by country
        Map<String, Double> countryPop=
            ls.stream().collect( 
                 Collectors.groupingBy( c -> c.country, 
                                        Collectors.summingDouble( c -> c.population ) ) );

        countryPop.entrySet().stream().forEach(System.out::println);

Prerequisite

To run this aardvark.code example you need to have following software installed on your system:

python and pandas
java8
aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/best_tool/aardvark.code

Execute

Execute aardvark. And get a sum of city-population per EU28 country.

citylist contains: 57033 records.

DE=8.5441224E7
FI=5179342.0
BE=1.0110726E7
PT=7090718.0
BG=5457463.0
DK=4452963.0
LT=2555924.0
LU=358224.0
LV=1720939.0
HR=3743111.0
FR=5.2697218E7
HU=1.0263483E7
SE=7802936.0
SI=1182980.0
SK=2953279.0
GB=6.3445174E7
IE=3548735.0
EE=995124.0
MT=398419.0
IT=5.2402319E7
GR=8484595.0
ES=4.9738095E7
AT=4921470.0
CY=797327.0
CZ=8717969.0
PL=2.8776423E7
RO=2.3299453E7
NL=1.501321E7

The aardvark.code

##================================================================================
##== tmp/load.py =================================================================
#!/usr/bin/python 
# -*- coding: utf-8 -*-

import pandas as pd
import csv   

typenames= [ ('long'  , 'geonameid'),
             ('String', 'name'),
             ('String', 'asciiname'),
             ('double', 'latitude'),
             ('double', 'longitude'),
             ('String', 'country'),
             ('double', 'population'),
             ('double', 'elevation') ]

colnames= map( lambda r: r[1], typenames )

df=pd.io.parsers.read_table("/u01/data/20150102_cities/cities1000.txt",
                sep="\t", header=None, names= colnames,
                quoting=csv.QUOTE_NONE,usecols=[ 0, 1, 2, 4, 5, 8, 14, 16],
                encoding='utf-8')
## LIMIT ON SIZE
#df=df[:1000]
df.to_csv('tmp/cities.tsv', index=False, sep='\t',encoding='utf-8', header=False)

##================================================================================
##== tmp/City.java =================================================================

class City {

    public long geonameid;
    public String name;
    public String asciiname;
    public double latitude;
    public double longitude;
    public String country;
    public double population;
    public double elevation;

    public City(
          long geonameid
        , String name
        , String asciiname
        , double latitude
        , double longitude
        , String country
        , double population
        , double elevation
    ) {
        this.geonameid=geonameid;
        this.name=name;
        this.asciiname=asciiname;
        this.latitude=latitude;
        this.longitude=longitude;
        this.country=country;
        this.population=population;
        this.elevation=elevation;
    }

    public static City digestLine(String s) {
        String[] rec=s.split("\t");
        return new City(
            Integer.parseInt(rec[0]),
            rec[1],
            rec[2],
            Double.parseDouble(rec[3]), // lat
            Double.parseDouble(rec[4]), // lon 
            rec[5],
            Double.parseDouble(rec[6]), // pop
            Double.parseDouble(rec[7])  // elevation
        );
    }
                                                                      
}


##================================================================================
##== tmp/Main.java =================================================================

import java.util.List;
import java.util.Map;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Arrays;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.charset.Charset;
import java.io.IOException;
import java.util.stream.Collectors;

public class Main {
    public static void main( String args[]) throws IOException {

        HashSet<String> eu28 = new HashSet<String>( Arrays.asList(
           "AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR", 
           "GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL", 
           "PL", "PT", "RO", "SE", "SI", "SK", "AN" ) )  ;

        Path p=Paths.get("cities.tsv");
        List<City>ls = Files.readAllLines(p, Charset.defaultCharset())
             .stream()
             .map( line -> City.digestLine(line))
             .filter( c -> eu28.contains(c.country) )   // only retain EU28 countries 
             .collect( Collectors.toList() ) ; 
        System.out.println("citylist contains: " + ls.size() + " records.");

        // aggregate: sum population by country
        Map<String, Double> countryPop=
            ls.stream().collect( 
                 Collectors.groupingBy( c -> c.country, 
                                        Collectors.summingDouble( c -> c.population ) ) );

        countryPop.entrySet().stream().forEach(System.out::println);
    }
}


##================================================================================
##== aardvark.sh =================================================================
#!/bin/bash 

# Part 1: use python to convert a csv file to a tab-separated file
chmod +x tmp/load.py 
./tmp/load.py 


# Part 2: compile the java code, and run it (conditionally)
S="Main.java"
T=${S%.java}.class
E=${S%.java}

# compile: but only if java code is younger then class
S_AGE=`stat -c %Y "tmp/"$S`
T_AGE=`stat -c %Y "tmp"/$T`
if [ -z $T_AGE ] || [ $T_AGE -le $S_AGE ]
then
    echo "## Compiling"
    (cd tmp; javac $S) 
fi

# check if class file was produced
if [ ! -e "tmp/"$T ] 
then
    echo "## '$T' doesn't exist, cannot execute it." 
    exit 1
fi

# execute
(cd tmp; java Main) 

09_docker

20161015

Docker helloworld.go

What?

Compile a go application, build it into a container and run it. Simple! This also shows how a configuration value ('YOURNAME') can be passed from the docker command line to the container on startup.

Prerequisite

To run this aardvark.code example you need to have following software installed on your system:

golang
docker
aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/docker/aardvark.code

Execute

Output of the 'build' :

--- Compiling -----------------------------------------------
--- Build container -----------------------------------------
Sending build context to Docker daemon 75.15 MB
Sending build context to Docker daemon 
Step 0 : FROM debian
 ---> 37c816ae4431
Step 1 : COPY helloworld .
 ---> bd81fc9a712e
Removing intermediate container 49887fffeea8
Step 2 : RUN chmod +x ./helloworld
 ---> Running in 52600566c5e8
 ---> 166b57edb033
Removing intermediate container 52600566c5e8
Step 3 : ENTRYPOINT ./helloworld
 ---> Running in 3e274e23d561
 ---> 952283f977f7
Removing intermediate container 3e274e23d561
Successfully built 952283f977f7

Output of running the container:

--- Run container -------------------------------------------
Yo CarréConfituurke, today is Saturday!

The aardvark.code

##== tmp/helloworld.go =========================================================
package main

import (
    "fmt"
    "time"
    "os"
)

func main() {
    day := time.Now().Weekday()
    name:=os.Getenv("YOURNAME")
    fmt.Printf("Yo %v, today is %v!\n", name, day) 
}

##== tmp/helloworld.dockerfile =================================================
FROM debian
COPY helloworld . 
ENTRYPOINT [ "./helloworld" ]

##== aardvark.sh ===============================================================
#!/bin/bash 

echo "--- Compiling -----------------------------------------------"
go build tmp/helloworld.go 

echo "--- Build container -----------------------------------------"
docker build -f tmp/helloworld.dockerfile -t helloworld:v1 .

echo "--- Run container -------------------------------------------"
docker run -e YOURNAME=CarréConfituurke  helloworld:v1 

10_static_ws

20161015

Simple static webserver

What?

It's always handy to have the code of a simple static webserver lying about. This one is in Go, and serves some literature!

Prerequisite

To run this aardvark.code example you need to have following software installed on your system:

golang
aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/staticws/aardvark.code

Execute

Output of the run :

aardvark

Extract from Project Gutenberg EBook of War and Peace, by Leo Tolstoy

Not only the generals in full parade uniforms, with their thin or 
thick waists drawn in to the utmost, their red necks squeezed into 
their stiff collars, and wearing scarves and all their decorations, 
not only the elegant, pomaded officers, but every soldier with his 
freshly washed and shaven face and his weapons clean and polished to 
the utmost, and every horse groomed till its coat shone like satin 
and every hair of its wetted mane lay smooth--felt that no small 
matter was happening, but an important and solemn affair. Every 
general and every soldier was conscious of his own insignificance, 
aware of being but a drop in that ocean of men, and yet at the same 
time was conscious of his strength as a part of that enormous whole.



Extract from the Project Gutenberg EBook of The Complete Works of William Shakespeare

Friends, Romans, countrymen, lend me your ears!
I come to bury Caesar, not to praise him.
The evil that men do lives after them,
The good is oft interred with their bones;
So let it be with Caesar. The noble Brutus
Hath told you Caesar was ambitious;
If it were so, it was a grievous fault,
And grievously hath Caesar answer'd it.
Here, under leave of Brutus and the rest-
For Brutus is an honorable man;
So are they all, all honorable men-
Come I to speak in Caesar's funeral.

The aardvark.code

##== tmp/staticws.go ========================================
package main

import (
    "fmt"
    "github.com/gorilla/mux"
    "net/http"
    "os"
    "time"
)

func main() {
    r := mux.NewRouter()
    r.PathPrefix("/static/").Handler(http.StripPrefix("/static/", http.FileServer(http.Dir("./tmp"))))
    srv := &http.Server{
        Handler:      r,
        Addr:         ":8642",
        WriteTimeout: 15 * time.Second,     // enforce timeouts for servers you create!
        ReadTimeout:  15 * time.Second, 
    }
    err:=srv.ListenAndServe()
    if err!=nil {
        fmt.Fprintf(os.Stderr, "Error starting server: %v\n" , err.Error())
    }
}

##== tmp/war_and_peace.txt ========================================
Extract from Project Gutenberg EBook of War and Peace, by Leo Tolstoy

Not only the generals in full parade uniforms, with their thin or 
thick waists drawn in to the utmost, their red necks squeezed into 
their stiff collars, and wearing scarves and all their decorations, 
not only the elegant, pomaded officers, but every soldier with his 
freshly washed and shaven face and his weapons clean and polished to 
the utmost, and every horse groomed till its coat shone like satin 
and every hair of its wetted mane lay smooth--felt that no small 
matter was happening, but an important and solemn affair. Every 
general and every soldier was conscious of his own insignificance, 
aware of being but a drop in that ocean of men, and yet at the same 
time was conscious of his strength as a part of that enormous whole.


##== tmp/julius_caesar.txt ========================================
Extract from the Project Gutenberg EBook of The Complete Works of William Shakespeare

Friends, Romans, countrymen, lend me your ears!
I come to bury Caesar, not to praise him.
The evil that men do lives after them,
The good is oft interred with their bones;
So let it be with Caesar. The noble Brutus
Hath told you Caesar was ambitious;
If it were so, it was a grievous fault,
And grievously hath Caesar answer'd it.
Here, under leave of Brutus and the rest-
For Brutus is an honorable man;
So are they all, all honorable men-
Come I to speak in Caesar's funeral.


##== aardvark.sh ========================================
#!/bin/bash 

go build tmp/staticws.go 

./staticws & 

curl http://localhost:8642/static/war_and_peace.txt
echo 
curl http://localhost:8642/static/julius_caesar.txt 


killall -u $USER staticws

11_code_gen

20161020

Generate Go code using templates

What?

This go application reads in the user supplied definition (or structure) of a CSV file, and generates 'reader.go', a fully working go application that reads that CSV file. It does this using the nifty template feature of Go.

If you want to customize the resulting code, then just change the template reader.tpl. Or for a another CSV file, just change the description.txt.

Prerequisite

To run this aardvark.code example you need to have following software installed on your system:

golang
aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/codegen/aardvark.code

Execute

Stage 1

Read a .txt file and turn it into a Go Descriptor object which contains a Col object per column definition.

The structure as defined in the description.txt file

EntityName:City
Geonameid int
Name  string 
Asciiname string 
Lat float64      # latitude
Lon float64      # longitude
Country string 
Population int
Elevation float64

Filename:/u01/data/20150102_cities/cities1000.txt
Separator:\t

.. will look like this Go Descriptor object

{ 
    EntityName:City 
    Filename:/u01/data/20150102_cities/cities1000.txt 
    Separator:\t 
    Numcols:8 
    Cols:[
        {Position:0  Identifier:Geonameid  Type:int     ConversionFlag:true } 
        {Position:1  Identifier:Name       Type:string  ConversionFlag:false} 
        {Position:2  Identifier:Asciiname  Type:string  ConversionFlag:false} 
        {Position:4  Identifier:Lat        Type:float64 ConversionFlag:true } 
        {Position:5  Identifier:Lon        Type:float64 ConversionFlag:true } 
        {Position:8  Identifier:Country    Type:string  ConversionFlag:false} 
        {Position:14 Identifier:Population Type:int     ConversionFlag:true } 
        {Position:16 Identifier:Elevation  Type:float64 ConversionFlag:true }
    ]
}

Stage 2

Read in the template 'reader.tpl', apply above data to it, and put the result in the file 'reader.go'.

I'm just going to lift the veil a bit, by showing what the following snippet of the template does, for more information about Go's template feature refer to the documentation: golang.org/pkg/text/template

When you apply the data to this template snipppet ..

type «.EntityName» struct {
«range .Cols»    «.Identifier» «.Type»
«end»}

.. this output will be produced:

type City struct { 
    Geonameid int
    Name string
    Asciiname string
    Lat float64
    Lon float64
    Country string
    Population int
    Elevation float64
}

For your own enlightenment, look for other «...» expressions in the 'reader.tpl' file, and see the code it generates in 'reader.go'.

Stage 3

Compile the 'reader.go' file, which reads in the CSV file, and shows a couple of records.

The aardvark.code

##== tmp/description.txt ========================================

EntityName:City
0  Geonameid int
1  Name  string 
2  Asciiname string 
4  Lat float64      # latitude
5  Lon float64      # longitude
8  Country string 
14 Population int
16 Elevation float64

Filename:/u01/data/20150102_cities/cities1000.txt
Separator:\t

##== tmp/reader.tpl ========================================

package main 

import ( 
    "os"
    "strconv"
    "bufio"
    "fmt"
    "io"
    "strings"
) 

func main() { 

    filename:="«.Filename»"
    f,err := os.Open(filename)
    defer f.Close()
    if err != nil {
        fmt.Fprintf(os.Stderr, "Opening file %q: %s\n", filename,err.Error())
        os.Exit(1) 
    }
    r:=bufio.NewReader(f)
    repeat:=true
    
    ignoredLines:=0
    list:=make([]«.EntityName»,0,0) 
    for repeat {
        line,overflow,err := r.ReadLine()
        repeat = (err!=io.EOF) // EOF means stop repeating this loop
        if err != nil && err!=io.EOF {
            fmt.Fprintf(os.Stderr, "Read error: %s\n", err.Error())
            break
        }
        if overflow {
            fmt.Fprintf(os.Stderr, "Overflow error on reading!\n")
            break
        }
        recs:=strings.Split(string(line),"«.Separator»")
        if len(recs)>«.Numcols» { 
            row,err:=extract( strings.Split(string(line),"«.Separator»"))
            if err!=nil { 
                // assume error already reported
                break
            }
            list=append(list,row)
        } else { 
            ignoredLines+=1
        }
    }
    if ignoredLines>0 { 
        fmt.Fprintf(os.Stderr, "Warning: %v line(s) ignored because of too few fields.\n",ignoredLines)  
    }
    for i,r:=range(list) { 
        fmt.Printf("%v\n",r)
        if i>10 { 
            break
        }
    }
}

type «.EntityName» struct { 
«range .Cols»    «.Identifier» «.Type»
«end»}

func extract(rec []string) (record «.EntityName»,err error) { 
«range .Cols»«if .ConversionFlag»«template "convert" .»«else»    _«.Identifier» := rec[«.Position»]«end»
«end»
    record = «.EntityName»{ «range .Cols»«.Identifier»:_«.Identifier», «end»}
    return
}

«define "convert"»«if eq .Type "int"»    _«.Identifier»:=0
    if len(rec[«.Position»])>0 {
        _«.Identifier»,err=strconv.Atoi(rec[«.Position»])
        if err != nil {
            fmt.Fprintf(os.Stderr, "Error converting «.Identifier»: %v\n", err.Error())
            return
        }
    }«end»«if eq .Type "float64"»    _«.Identifier»:=0.0
    if len(rec[«.Position»])>0 {
        _«.Identifier»,err=strconv.ParseFloat(rec[«.Position»],64)
        if err != nil {
            fmt.Fprintf(os.Stderr, "Error converting «.Identifier»: %v\n", err.Error())
            return
        }
    }«end»«end»


##== tmp/grok.go ========================================
package main 

import (
    "fmt"
    "bufio"
    "os"
    "regexp"
    "io/ioutil" 
    "strings"
    "strconv"
    "text/template"
)

type Descriptor struct { 
    EntityName string
    Filename string
    Separator string
    Numcols int
    Cols []Col    
}

type Col struct {
    Position int
    Identifier  string
    Type string
    ConversionFlag bool
}

func main() {
    d,err:=getDescriptor("tmp/description.txt")     // read the description 
    if err != nil {
        os.Exit(1) 
    }
fmt.Printf("%+v\n",d)   // DROPME TODO 
    f,err:=os.Create("tmp/reader.go")               // prepare file for output
    if err != nil {
        fmt.Fprintf(os.Stderr, "File open error: %s\n", err.Error())
        os.Exit(1) 
    }
    defer f.Close()
    w:=bufio.NewWriter(f)

    t:=template.New("reader.tpl")                   // create template 
    t.Delims("«","»")
    t=template.Must(t.ParseFiles("tmp/reader.tpl")) 
    err=t.Execute(w,d)                              // execute the template
    if err!=nil {
        fmt.Fprintf(os.Stderr, "Template execute error: %s\n", err.Error())
    }
    w.Flush()
}


func getDescriptor(filename string) (desc Descriptor, err error) { 
    desc=Descriptor{ EntityName:"x" } 

    content, err := ioutil.ReadFile(filename)
    if err != nil {
        fmt.Fprintf(os.Stderr, "File read error: %s\n", err.Error())
    }
    body:=strings.TrimSpace(strings.Replace( string(content), "\n","|",-1) )

    // regular expressions matching 1) key:value pair 2) fields line
    reKeyValue := regexp.MustCompile(`^\s*(\w+)\s*:\s*(\S+).*`)
    reFields   := regexp.MustCompile(`^\s*(\d+)\s*(\w+)\s*(\w+).*`)

    desc.Cols = make([]Col, 0, 0)
    for _,line:= range strings.Split(body,"|") { 
        if n:=strings.Index(line,"#"); n>-1 {       // remove comments
            line=line[:n]
        }
        line=strings.TrimSpace(line)                // empty string? 
        if len(line)<=1 { 
            continue
        } 
        group:=reKeyValue.FindStringSubmatch(line)  // pattern: key:value
        if group!=nil { 
            digestKeyValue(&desc, group) 
            continue    
        } 
        group=reFields.FindStringSubmatch(line)     // pattern: num word word 
        if group!=nil { 
            err=digestFields(&desc, group)
            if err!=nil { 
                break
            }
        }
    }
    desc.Numcols=len(desc.Cols)  
    return 
}

func digestKeyValue(desc *Descriptor, group []string) { 
    k:= group[1]
    v:= group[2] 
    if (k=="EntityName") { 
        desc.EntityName=v
    } else if (k=="Filename") { 
        desc.Filename=v
    } else if (k=="Separator") { 
        desc.Separator=v
    } else { 
        fmt.Fprintf(os.Stderr, "WARNING: Key:Value pair %v:%v ignored\n", k,v) 
    }
}

func digestFields(desc *Descriptor, group []string) (err error) { 
    p,err:=strconv.Atoi(group[1])
    if err != nil {
        fmt.Fprintf(os.Stderr, "Conversion error: %s\n", err.Error())
        return 
    }
    id:=group[2]
    desc.Cols=append(desc.Cols, Col{ Position:p, 
                                   Identifier:id,
                                   Type: group[3],
                                   ConversionFlag: group[3]!="string" })
    return 
}



##== aardvark.sh ========================================
#!/bin/bash 

rm grok tmp/reader.go reader    # cleanup 

go build tmp/grok.go            # build the code-generator 

if [ -x ./grok ]
then
    ./grok 
fi

if [ -f ./tmp/reader.go ]
then
    go build tmp/reader.go     # build the generated go code
    ./reader                   # and execute
fi

12_append

20180224

Aardvark in append mode

What?

Normally when a filename occurs more than once in an aardvark file, it will get overwritten. But if you precede the filename with a plus-sign, aardvark will not attempt to overwrite the existing file but append to it.

Prerequisite

To run this aardvark.code example you need to have following software installed on your system:

aardvark.

Go aardvark

Grab this aardvark.code file:

wget http://data.munging.ninja/aardvarkcode/append/aardvark.code

Look at the file

Spot the '+' signs in front of the filenames (eg +spanish.txt )

##== spanish.txt  =========================================================
El lago es el segundo lago más grande de América del Sur, 
localizado en la Patagonia y compartido por Chile y Argentina.

##== english.txt  =========================================================
The lake is the second biggest lake of South America, located 
in Pataganio and shared by Chile and Argentina.

##== nederlands.txt  ======================================================
Het meer is het tweede grootste meer van Zuid Amerika, gelocaliseerd 
in Patagonie and gedeeld door Chili en Argentinie.

##== +spanish.txt  ========================================================
A cada lado de la frontera tiene nombres diferentes: en Chile es 
conocido como lago General Carrera, mientras que 
en Argentina se le denomina lago Buenos Aires.

##== +english.txt  ========================================================
At both sides of the border it has a different name: in Chili it is 
known as Lake General Carrera, while in Argentina it is named
Lake Buenos Aires. 

##== +nederlands.txt  =====================================================
Aan beide zijden van de grense heeft het een verschillende naam: in Chile 
het is bekend als meer General Carrera, terwijl in Argentinie het 
meer Buenos Aires benoemd werd. 

Run aardvark

As you could have guessed, when you run aardvark on this file, you get 3 output files, each in a particular language.

Output: spanish.txt

El lago es el segundo lago más grande de América del Sur, 
localizado en la Patagonia y compartido por Chile y Argentina.

A cada lado de la frontera tiene nombres diferentes: en Chile es 
conocido como lago General Carrera, mientras que 
en Argentina se le denomina lago Buenos Aires.

Output: english.txt

The lake is the second biggest lake of South America, located 
in Pataganio and shared by Chile and Argentina.

At both sides of the border it has a different name: in Chili it is 
known as Lake General Carrera, while in Argentina it is named
Lake Buenos Aires. 

Output: nederlands.txt

Het meer is het tweede grootste meer van Zuid Amerika, gelocaliseerd 
in Patagonie and gedeeld door Chili en Argentinie.

Aan beide zijden van de grense heeft het een verschillende naam: in Chile 
het is bekend als meer General Carrera, terwijl in Argentinie het 
meer Buenos Aires benoemd werd. 

TL;DR

Detail

Advantages

Install aardvark

Go get style

Manually

Simple example: Hello World

What?

Prerequisite

Sqlite example

What?

Prerequisite

Go aardvark

The $key and value

Execute

The aardvark.code file

Postgresql / Java / R example

What?

Prerequisite

Go aardvark

Execute

The aardvark.code

Spark example

What?

Prerequisite

Go aardvark

Execute

The aardvark.code

Timing: compare execution of two similar processes

What?

Prerequisite

Go aardvark

Execute

The aardvark.code

Closing

Frequently asked questions

Why name it 'aardvark' ?

What is a filename that starts with a hash?

What is this [[$sql]] notation ?

What is the difference between aardvark, aardvark.code and aardvark.sh ?

Use the best tool for the job

What?

Detail about the java8 Aggregation

Prerequisite

Go aardvark

Execute

The aardvark.code

Docker helloworld.go

What?

Prerequisite

Go aardvark

Execute

The aardvark.code

Simple static webserver

What?

Prerequisite

Go aardvark

Execute

The aardvark.code

Generate Go code using templates

What?

Prerequisite

Go aardvark

Execute

Stage 1

Stage 2

Stage 3

The aardvark.code

Aardvark in append mode

What?

Prerequisite

Go aardvark

Look at the file

Run aardvark

Output: spanish.txt

Output: english.txt

Output: nederlands.txt

What is this `[[$sql]]` notation ?