|
TL;DR
Aardvark is about putting a bunch of code files together in one file, and executing one command to do all that's necessary (compile, execute, launch on a cluster, ship accross the network,...) to produce the desired output.
Stop being a manager of files, but concentrate on code writing!
Detail
The 'itch' that led to me to the 'scratch' was the different development environments I had to open up, to run relatively simple code on a spark cluster, use python to copy the results from hdfs, and then use R to plot a nice chart from it. Noticing that switching between contexts, was not helping my concentration nor focus, I decided to put all the code together in one big code file, and use a utility (ie. aardvark) to split it up into smaller files.
So I 'cat'-ted all my source files into one file named aardvark.code , each entry separated by a line containing
##== <filename>
and ran aardvark on it.
NOTE: on Wed 1st Jun, the split pattern was changed from ##-- to ##== !
Now the number of windows opened in my development environment was reduced to just two terminal windows: one running vi to edit the code, and the other one to kick off the 'aardvark' command.
In this one file, aardvark.code you can also add your data, documentation, etc ..., it's like taking the object-orientated concept of encapsulation a step further: encapsulate not only the code, but also the data, documentation, and compilation/execution/test instructions together in one single file.
Another idea is to put what is important, or what changes a lot, close to the top, like a 'TL;DR' section or an executive summary.
Advantages
- less file management chores to do, concentrate on writing code
- less context switching (from eclipse to r-studio to idle to the command-line to firefox to.. ) is good for the brain (google 'nytimes unitasking distraction': doing less but getting more done)
- use the best bits of each world: db for data-manipulation, python as swiss-army, R for plotting charts, markdown for documentation, csv for data, golang for speed, .., and glue it together using aardvark
- reduce your mouse usage, keep those fingers on the keyboard, and concentrate on the code
- no need to wonder what script needs to be executed for kicking of the code in that directory: by convention it is always
aardvark.sh
- easy to work on a new (or refactored) version of a project while keeping the old one ready for execution and testing (just have 2 aardvark files eg.
aardvark.code and old.code )
Writing code the 'aardvark' style is ideal for a number of relatively short scripts, that have disparate execution environments, and can all be executed from the command line.
Install aardvark
Go get style
With your $GOPATH variable properly set:
$ go get github.com/dtmngngnnj/aardvark
$ cp $GOPATH/bin/aardvark ~/bin
Manually
Grab a copy of aardvark.go from github, compile it, and copy the resulting executable to your bin directory:
$ wget https://github.com/dtmngngnnj/aardvark/raw/master/aardvark.go
$ go build aardvark.go
$ cp aardvark ~/bin
03_simple_example
20160515
Simple example: Hello World
What?
In following example aardvark extracts 3 files from the aardvark.code file: an R-, python- and bash-script. Then the aardvark.sh (the bash script) is executed. The R-script emits 'hello world', which gets capitalized in the Python script.
Prerequisite
For this example you need to have following software installed on your computer:
In a newly created directory (aka folder), put the file aardvark.code , that you can grab like this:
wget http://data.munging.ninja/aardvarkcode/simple_example/aardvark.code
Here's the content:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| ##================================================================================
##== generate_data.R
cat ("hello world!")
##================================================================================
##== capitalize.py
#!/usr/bin/python
import sys
import string
for line in sys.stdin:
print string.capwords(line)
##================================================================================
##== aardvark.sh
#!/bin/bash
R_EXE="/usr/bin/R --slave --vanilla --quiet"
PY_EXE="/usr/bin/python2"
$R_EXE -f ./generate_data.R | $PY_EXE ./capitalize.py
|
Execute aardvark. After some housekeeping messages, you'll see:
$ aardvark
..
..
Hello World!
Sqlite example
What?
In following example aardvark stores one entry's content in the tag-dictionary, and extracts a 3 files from the aardvark.code file:
- the
create_load_exec.sql script
- the
aardvark.sh script (which gets auto-executed when aardvark finishes writing the files)
- the
data.csv file
The executed script pumps data into a sqlite database, and runs a sql query on it.
Prerequisite
For this example you need to have following software installed on your computer:
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/sqlite/aardvark.code
The $key and value
Look at the code: when aardvark finds a 'filename' that starts with a dollar '$' (eg $sql), it is not considered a file but a key/value pair. The value (content) is stored in a dictionary under the key (eg. '$sql'). Further down in the 'aardvark.code' file, this content is pulled into a script by the identifier '[[$sql]]'.
Execute
Execute aardvark. After some housekeeping messages, you'll see the result from the query:
$ aardvark
..
..
TNM SCRM Magallanes & Antártica
SMB SCSB Magallanes & Antártica
WPR SCFM Magallanes & Antártica
PNT SCNT Magallanes & Antártica
WPU SCGZ Magallanes & Antártica
PUQ SCCI Magallanes & Antártica
The aardvark.code file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
| ##== $sql =====================================================================
select iata,airport_name,region
from t_airport
where region like '%ca' ;
##=============================================================================
##== create_load_exec.sql -----------------------------------------------------
create table t_airport (
iata varchar(8)
,icao varchar(8)
,city_served varchar(64)
,region varchar(64)
,airport_name varchar(128)
);
.mode csv
.import data.csv t_airport
.mode tabs
[[$sql]]
##=============================================================================
##== aardvark.sh --------------------------------------------------------------
#!/bin/bash
rm test.db
cat create_load_exec.sql | sqlite3 test.db
##=============================================================================
##== data.csv -----------------------------------------------------------------
WAP,SCAP,Alto Palena,Los Lagos,SCAP
ZUD,SCAC,Ancud,Los Lagos,SCAC
TNM,SCRM,Antarctica,Magallanes & Antártica,SCRM
ANF,SCFA,Antofagasta,Antofagasta,SCFA
ARI,SCAR,Arica,Arica & Parinacota,SCAR
BBA,SCBA,Balmaceda,Aisén,SCBA
CJC,SCCF,Calama,Antofagasta,SCCF
WCA,SCST,Castro,Los Lagos,SCST
SMB,SCSB,Cerro Sombrero,Magallanes & Antártica,SCSB
WCH,SCTN,Chaitén,Los Lagos,SCTN
CNR,SCRA,Chañaral,Atacama,SCRA
CCH,SCCC,Chile Chico,Aisén,SCCC
YAI,SCCH,Chillán,Biobío,SCCH
GXQ,SCCY,Coihaique,Aisén,SCCY
LGR,SCHR,Cochrane,Aisén,SCHR
CCP,SCIE,Concepción,Biobío,SCIE
CPO,SCHA,Copiapó,Atacama,SCHA
COW,SCQB,Coquimbo,Coquimbo,SCQB
ZCQ,SCIC,Curicó,Maule,SCIC
ESR,SCES,El Salvador,Atacama,SCES
FFU,SCFT,Futaleufú,Los Lagos,SCFT
IQQ,SCDA,Iquique,Tarapacá,SCDA
IPC,SCIP,Isla de Pascua,Valparaíso,SCIP
LSC,SCSE,La Serena,Coquimbo,SCSE
ZLR,SCLN,Linares,Maule,SCLN
LOB,SCAN,Los Andes,Valparaíso,SCAN
LSQ,SCAG,Los Ángeles,Biobío,SCAG
ZOS,SCJO,Osorno,Los Lagos,SCJO
OVL,SCOV,Ovalle,Coquimbo,SCOV
WPR,SCFM,Porvenir,Magallanes & Antártica,SCFM
ZPC,SCPC,Pucón,Araucanía,SCPC
WPA,SCAS,Puerto Aisén,Aisén,SCAS
PMC,SCTE,Puerto Montt,Los Lagos,SCTE
PNT,SCNT,Puerto Natales,Magallanes & Antártica,SCNT
WPU,SCGZ,Puerto Williams,Magallanes & Antártica,SCGZ
PUQ,SCCI,Punta Arenas,Magallanes & Antártica,SCCI
QRC,SCRG,Rancagua,OHiggins,SCRG
SSD,SCSF,San Felipe,Valparaíso,SCSF
SCL,SCEL,Santiago,Santiago Metropolitan,SCEL
ULC,SCTI,Santiago,Santiago Metropolitan,SCTI
TLX,SCTL,Talca,Maule,SCTL
ZCO,SCTC,Temuco,Araucanía,SCTC
TOQ,SCBE,Tocopilla,Antofagasta,SCBE
ZAL,SCVD,Valdivia,Los Ríos,SCVD
VLR,SCLL,Vallenar,Atacama,SCLL
VAP,SCVA,Valparaíso,Valparaíso,SCVA
KNA,SCVM,Viña del Mar - Concón,Valparaíso,SCVM
|
Postgresql / Java / R example
What?
Java is used to run a sql-query on a Postgres database. The result is loaded into a data.frame in R, and a pie chart is plotted. Note: the figures are the sum of city populations, not the whole country population, be carefull before you broadcast these 'facts' !
Prerequisite
For this example you need to have following software installed on your computer:
- you have a postgres db running with the city data inserted into t_city, as described on page Load City Data (click on the 'sql' tab)
- the postgres jdbc jar file
- java jdk
- R
- aardvark
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/sqljavar/aardvark.code
Look at the following code: the lines of sql code will get inserted into the java code (via tag dictionary). The java program prints out the resultset, which is piped into a file, which gets read by R, and which produces a colorfoul pie chart from it.
Execute
Execute aardvark. Have a look at the produced pie chart:
$ display pie.png
The aardvark.code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
| ##================================================================================
##== $sqlquery_java =============================================================
" select country, sum(population) as sum_pop "+
" from t_city "+
" where country in ('AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR',"+
" 'GB', 'GR', 'HR', 'HU', 'IE', 'IT', 'LT', 'LU', 'LV', 'MT', 'NL',"+
" 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'AN')"+
" group by country"+
" order by 2 desc;"
##================================================================================
##== $print_result_java =========================================================
System.out.printf("%s\t%d\n", rs.getString(1), rs.getInt(2));
##================================================================================
##== plot.R ======================================================================
topn=7 # experiment: eg. change to top-10 or top-5
df<-read.table('result.csv',sep="\t",header=F)
colnames(df)=c("country", "population")
pd=rbind( df[1:topn,], data.frame(country="Rest",
population=sum(df[(topn+1):nrow(df),"population"])) )
pct <- round(pd$population/sum(pd$population)*100)
pd$label=paste(pd$country," (",pct,"%)",sep="")
#x11(width=800, height=300)
png('pie.png',width=800, height=400)
par(mfrow = c(1, 2))
pie(pd$population,labels=pd$label,main="EU population before Brexit",
col=rainbow(nrow(pd)))
# drop GB
df<-df[ df$country!='GB',]
pd=rbind( df[1:topn,], data.frame(country="Rest",
population=sum(df[(topn+1):nrow(df),"population"])) )
pct <- round(pd$population/sum(pd$population)*100)
pd$label=paste(pd$country," (",pct,"%)",sep="")
pie(pd$population,labels=pd$label,main="EU population after Brexit",
col=rainbow(nrow(pd)))
dev.off()
##================================================================================
##== dirty/query/Query.java ======================================================
package query;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class Query {
public static void main( String args[]) {
Connection con = null;
Statement stmt = null;
try {
Class.forName("org.postgresql.Driver");
con = DriverManager.getConnection("jdbc:postgresql://172.16.1.43:5432/dmn",
"dmn", "dmn");
con.setAutoCommit(false);
stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(
[[$sqlquery_java]]
);
while ( rs.next() ) {
[[$print_result_java]]
}
rs.close();
stmt.close();
con.close();
} catch ( Exception e ) {
System.err.println( e.getClass().getName()+": "+ e.getMessage() );
System.exit(0);
}
}
}
##================================================================================
##== aardvark.sh =================================================================
#!/bin/bash
export DB_RESULT="result.csv"
# --------------------------------------------------------------
# Part 1: compile the java file, and run it (conditionally)
export POSTGRESJDBC="/opt/jdbc/postgres/postgresql-9.4.1208.jar"
S="query/Query.java"
T=${S%.java}.class
E=${S%.java}
# compile: but only if java code is younger then class
S_AGE=`stat -c %Y "dirty/"$S`
T_AGE=`stat -c %Y "dirty"/$T`
if [ -z $T_AGE ] || [ $T_AGE -le $S_AGE ]
then
echo "## Compiling"
(cd dirty; javac $S)
fi
# check if class file was produced
if [ ! -e "dirty/"$T ]
then
echo "## '$T' doesn't exist, can't run it."
exit 1
fi
# execute
echo "Fetching data from DB"
java -cp $POSTGRESJDBC:dirty $E $* > $DB_RESULT
# --------------------------------------------------------------
# Part 2: kick off R
echo "Plotting"
R_EXE="/usr/bin/R --slave --vanilla --quiet"
$R_EXE -f ./plot.R
|
Spark example
What?
This example is about running a Spark Scala job on the cluster.
The same NYC taxi data is used, as was described in article 'All Cylinders', but now to calculate the average tip per ride per weekday. Also see that article for more information about the 'build.sbt' and 'assembly.sbt' files.
The final barchart produced looks like this:
Prerequisite
The Scala build tool ('sbt') has been installed on your system, as well as spark, a hadoop-client (for accessing hdfs), spark and finally R for plotting. And of course aardvark.
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/spark/aardvark.code
The two most important files are on top:
- the scala query
- the plotting of the chart using R
Execute
Execute aardvark. Go for coffee. Come back. Have a look at the chart:
$ display barchart.png
The aardvark.code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
| ##========================================================================
##== $query_scala
// input is 'in_rdd', output is 'out_rdd'
// for every line in the file (except the header), split it into fields,
// and 'emit' a tuple containing
// key: day-of-week, (prepended with number for sorting eg. "3-WED")
// value: (1, tip_amount)
val ride=in_rdd.filter( !_.startsWith("VendorID") ).
map( line => {
val spl=line.split(",")
val dateFmt= DateTimeFormatter.ofPattern("yyyy-MM-dd")
val dt=LocalDate.parse( spl(1).substring(0,10), dateFmt)
val dows=dt.getDayOfWeek().toString().substring(0,3)
val down=dt.getDayOfWeek().getValue()
( s"$down-$dows", (1, spl(15).toDouble) )
})
// sum up, per day-of-week
val tuple=ride.reduceByKey( (a,b) => (a._1+b._1, a._2+b._2))
// output: divide tips by num-rides, to get average
val out_rdd=tuple.map( r => {
val (k,v)=(r._1,r._2)
if (v._1!=0) (k, v._2/v._1.toDouble)
else (k, 0)
} )
##========================================================================
##== plot.R
png('barchart.png',width=800, height=400)
df<-read.table('output.txt', sep=',', header=F)
names(df)<-c("dow","val")
dfo=df[order(df$dow),]
dfo$dow=sub('^..','',dfo$dow)
barplot( dfo$val, names.arg=dfo$dow,
main="Average tip per ride",sub="2015" )
dev.off()
##========================================================================
##== Taxi.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger
import java.time.LocalDate
import java.time.format.DateTimeFormatter
object Taxi {
def main(arg: Array[String]) {
var logger = Logger.getLogger(this.getClass())
// Arguments
if (arg.length < 2) {
logger.error("No input/output path!")
System.err.println("No input/output path!")
System.exit(1)
}
val inpath = arg(0)
val outpath = arg(1)
// setup sparkcontext
val jobname = "Taxi"
val conf = new SparkConf().setAppName(jobname)
val sc = new SparkContext(conf)
logger.info(s"Job=${jobname} Inpath=${inpath} Outpath=${outpath} " )
val in_rdd=sc.textFile(inpath) // the taxi file
[[$query_scala]]
out_rdd.saveAsTextFile(outpath)
}
}
##========================================================================
##== build.sbt
mainClass in assembly := Some("Taxi")
jarName in assembly := "taxi.jar"
lazy val root = (project in file(".")).
settings(
name := "taxi",
version := "1.0"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
)
##========================================================================
##== project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
##========================================================================
##== aardvark.sh
#!/bin/bash
# *********************************************************
# *** PART 0: checks before running ***********************
if [ -z $HADOOP_HOME ]; then
echo "Variable 'HADOOP_HOME' is not set!"
exit 1
fi
if [ -z $SPARK_HOME ]; then
echo "Variable 'SPARK_HOME' is not set!"
exit 1
fi
# *********************************************************
# *** PART 1: assemble the jar file ***********************
# compare age of source (scala file) and target (jar file)
S_DATE=`stat -c %Y Taxi.scala`
T_DATE=0
JARFILE=`ls target/scala*/taxi.jar`
if [ ! -z $JARFILE ]
then
T_DATE=`stat -c %Y $JARFILE`
fi
if [ $T_DATE -le $S_DATE ]
then
echo "*** sbt assembly ***"
echo "(if this is the first run, go for a coffee break)"
sbt assembly
fi
# *********************************************************
# *** PART 2: launch jar on the spark cluster *************
# condition 1: the jarfile should exist
JARFILE=`ls target/scala*/taxi.jar`
if [ ! -f $JARFILE ]
then
echo "'$JARFILE' doesn't exist, can't run it."
exit 1
fi
# condition 2: the jar file should be younger than
# the scala sourcefile
S_DATE=`stat -c %Y Taxi.scala`
T_DATE=`stat -c %Y $JARFILE`
if [ $T_DATE -le $S_DATE ]
then
echo "'$JARFILE' is older than source, not running"
exit 1
fi
# define job input/output paths
OUTPUT_PATH=hdfs:///user/wildadm/tip_per_ride
INPUT_PATH=hdfs:///user/wildadm/20160421_nyc_taxi
#INPUT_PATH=hdfs:///user/wildadm/20160421_nyc_taxi_subset
# PRE-LAUNCH: delete the output directory
$HADOOP_HOME/bin/hdfs dfs -rm -r tip_per_ride
# LAUNCH
$SPARK_HOME/bin/spark-submit --master yarn-cluster \
--num-executors 12 \
target/scala-2.10/taxi.jar \
$INPUT_PATH $OUTPUT_PATH
# *********************************************************
# *** PART 3: post-run, fetch data from hdfs **************
$HADOOP_HOME/bin/hdfs dfs -cat $OUTPUT_PATH/part* |\
sed -e 's/^(//' -e 's/)$//' > output.txt
# *********************************************************
# *** PART 4: plot the output *****************************
/usr/bin/R --slave --vanilla --quiet -f ./plot.R
# *********************************************************
# *** THE END *********************************************
echo "Done!"
|
Timing: compare execution of two similar processes
What?
Wise.io have created paratext to bump up the speed of CSV parsing. See this article or github. Here is a comparison of paratext and pandas loading a big CSV file: there definitively is a difference in performance, though not outspokenly big. See the following chart: paratext in blue, pandas in red. Y-axis is the time taken, lower is better. X-axis is number of lines read from 1000 to 30 million.
This test was executed on a Xeon CPU E5-2660 @ 2.20GHz (8 cores), and 61 gig mem available. The data-file loaded was the 'train.csv' of Kaggle's Expedia Hotel competition.
Prerequisite
To run this aardvark.code example you need to have following software installed on your system:
- python and pandas
- paratext (see above github link on how to install)
- R (for plotting)
- aardvark.
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/timing/aardvark.code
Execute
Execute aardvark. Wait a while. Then admire the chart:
$ display chart.png
The aardvark.code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| ##================================================================================
##== pandas_load.py
import pandas as pd
df=pd.io.parsers.read_table("sample.csv",sep=',')
##================================================================================
##== para_load.py
import pandas as pd
import paratext
df = paratext.load_csv_to_pandas('sample.csv')
##================================================================================
##== plot.R
png('chart.png',width=800, height=400)
df<-read.table('timing.csv', sep='|', header=F)
x=df[df$V1=='pandas_load.py',c('V2')]
y1=df[df$V1=='pandas_load.py',c('V3')]
y2=df[df$V1=='para_load.py',c('V3')]
plot(x,y1,type='b',pch=19,col='red', main="Load CSV: Pandas vs Paratext", xlab="numlines", ylab="time")
lines(x,y2,type='b',pch=19,col='blue')
dev.off()
##================================================================================
##== aardvark.sh
#!/bin/bash
PY_EXE="/usr/bin/python2"
rm timing.csv
for N in 1000 10000 25000 50000 75000 100000 250000 500000 750000 1000000 2500000 \
5000000 7500000 10000000 15000000 20000000 25000000 30000000
do
head -$N train.csv > sample.csv
for PYSCRIPT in pandas_load.py para_load.py
do
/usr/bin/time -f "$PYSCRIPT|$N|%e|%U|%S" $PY_EXE $PYSCRIPT 2>> timing.csv
done
done
# plot the result
/usr/bin/R --slave --vanilla --quiet -f ./plot.R
|
Closing
The timing.csv data-file produced by the script, and consumed by the plot.R code, looks like this:
pandas_load.py|1000|0.38|0.24|0.12
para_load.py|1000|0.39|0.28|0.10
pandas_load.py|10000|0.41|0.26|0.14
para_load.py|10000|0.40|0.30|0.12
pandas_load.py|25000|0.66|0.43|0.22
para_load.py|25000|0.49|0.41|0.13
pandas_load.py|50000|0.53|0.39|0.12
para_load.py|50000|0.47|0.43|0.15
pandas_load.py|75000|0.59|0.40|0.18
para_load.py|75000|0.52|0.55|0.13
pandas_load.py|100000|0.67|0.52|0.14
..
..
Frequently asked questions
Why name it 'aardvark' ?
The first proper word on page one in the dictionary is important! That's one aspect of what aardvark is about: putting first what's most important. Also: it will show up as one of the first files when doing 'ls -l' ...
What is a filename that starts with a hash?
It's a key, and its content will not be written to file, but stored in a tag-dictionary. See the sqlite example.
What is this [[$sql]] notation ?
In your code you can pull in the text of previously stored tags. See previous question and also the sqlite example.
What is the difference between aardvark, aardvark.code and aardvark.sh ?
aardvark is the utility that does the splitting
aardvark.code is your file containing R, python, java, scala, ... code
aardvark.sh is the script that is executed by aardvark after splitting the aardvark.code file. Best include aardvark.sh as code in your aardvark.code file.
What?
Instead of using a java CSV library, use Python Pandas to preprocess a complex CSV file (ie. one with embedded comma's), and to write it out as tab-separated fields, dropping some columns while we are at it.
Then use Java to read the simply splittable TSV file, and perform aggregation on it, using java8 streams.
Detail about the java8 Aggregation
Read the data in streaming fashion, converting every line to a City record, and filtering out the EU28 countries:
102
103
104
105
106
107
108
| Path p=Paths.get("cities.tsv");
List<City>ls = Files.readAllLines(p, Charset.defaultCharset())
.stream()
.map( line -> City.digestLine(line))
.filter( c -> eu28.contains(c.country) ) // only retain EU28 countries
.collect( Collectors.toList() ) ;
System.out.println("citylist contains: " + ls.size() + " records.");
|
Then perform the aggregation:
110
111
112
113
114
115
116
| // aggregate: sum population by country
Map<String, Double> countryPop=
ls.stream().collect(
Collectors.groupingBy( c -> c.country,
Collectors.summingDouble( c -> c.population ) ) );
countryPop.entrySet().stream().forEach(System.out::println);
|
Prerequisite
To run this aardvark.code example you need to have following software installed on your system:
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/best_tool/aardvark.code
Execute
Execute aardvark. And get a sum of city-population per EU28 country.
citylist contains: 57033 records.
DE=8.5441224E7
FI=5179342.0
BE=1.0110726E7
PT=7090718.0
BG=5457463.0
DK=4452963.0
LT=2555924.0
LU=358224.0
LV=1720939.0
HR=3743111.0
FR=5.2697218E7
HU=1.0263483E7
SE=7802936.0
SI=1182980.0
SK=2953279.0
GB=6.3445174E7
IE=3548735.0
EE=995124.0
MT=398419.0
IT=5.2402319E7
GR=8484595.0
ES=4.9738095E7
AT=4921470.0
CY=797327.0
CZ=8717969.0
PL=2.8776423E7
RO=2.3299453E7
NL=1.501321E7
The aardvark.code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
| ##================================================================================
##== tmp/load.py =================================================================
#!/usr/bin/python
# -*- coding: utf-8 -*-
import pandas as pd
import csv
typenames= [ ('long' , 'geonameid'),
('String', 'name'),
('String', 'asciiname'),
('double', 'latitude'),
('double', 'longitude'),
('String', 'country'),
('double', 'population'),
('double', 'elevation') ]
colnames= map( lambda r: r[1], typenames )
df=pd.io.parsers.read_table("/u01/data/20150102_cities/cities1000.txt",
sep="\t", header=None, names= colnames,
quoting=csv.QUOTE_NONE,usecols=[ 0, 1, 2, 4, 5, 8, 14, 16],
encoding='utf-8')
## LIMIT ON SIZE
#df=df[:1000]
df.to_csv('tmp/cities.tsv', index=False, sep='\t',encoding='utf-8', header=False)
##================================================================================
##== tmp/City.java =================================================================
class City {
public long geonameid;
public String name;
public String asciiname;
public double latitude;
public double longitude;
public String country;
public double population;
public double elevation;
public City(
long geonameid
, String name
, String asciiname
, double latitude
, double longitude
, String country
, double population
, double elevation
) {
this.geonameid=geonameid;
this.name=name;
this.asciiname=asciiname;
this.latitude=latitude;
this.longitude=longitude;
this.country=country;
this.population=population;
this.elevation=elevation;
}
public static City digestLine(String s) {
String[] rec=s.split("\t");
return new City(
Integer.parseInt(rec[0]),
rec[1],
rec[2],
Double.parseDouble(rec[3]), // lat
Double.parseDouble(rec[4]), // lon
rec[5],
Double.parseDouble(rec[6]), // pop
Double.parseDouble(rec[7]) // elevation
);
}
}
##================================================================================
##== tmp/Main.java =================================================================
import java.util.List;
import java.util.Map;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Arrays;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.charset.Charset;
import java.io.IOException;
import java.util.stream.Collectors;
public class Main {
public static void main( String args[]) throws IOException {
HashSet<String> eu28 = new HashSet<String>( Arrays.asList(
"AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
"GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
"PL", "PT", "RO", "SE", "SI", "SK", "AN" ) ) ;
Path p=Paths.get("cities.tsv");
List<City>ls = Files.readAllLines(p, Charset.defaultCharset())
.stream()
.map( line -> City.digestLine(line))
.filter( c -> eu28.contains(c.country) ) // only retain EU28 countries
.collect( Collectors.toList() ) ;
System.out.println("citylist contains: " + ls.size() + " records.");
// aggregate: sum population by country
Map<String, Double> countryPop=
ls.stream().collect(
Collectors.groupingBy( c -> c.country,
Collectors.summingDouble( c -> c.population ) ) );
countryPop.entrySet().stream().forEach(System.out::println);
}
}
##================================================================================
##== aardvark.sh =================================================================
#!/bin/bash
# Part 1: use python to convert a csv file to a tab-separated file
chmod +x tmp/load.py
./tmp/load.py
# Part 2: compile the java code, and run it (conditionally)
S="Main.java"
T=${S%.java}.class
E=${S%.java}
# compile: but only if java code is younger then class
S_AGE=`stat -c %Y "tmp/"$S`
T_AGE=`stat -c %Y "tmp"/$T`
if [ -z $T_AGE ] || [ $T_AGE -le $S_AGE ]
then
echo "## Compiling"
(cd tmp; javac $S)
fi
# check if class file was produced
if [ ! -e "tmp/"$T ]
then
echo "## '$T' doesn't exist, cannot execute it."
exit 1
fi
# execute
(cd tmp; java Main)
|
Docker helloworld.go
What?
Compile a go application, build it into a container and run it. Simple! This also shows how a configuration value ('YOURNAME') can be passed from the docker command line to the container on startup.
Prerequisite
To run this aardvark.code example you need to have following software installed on your system:
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/docker/aardvark.code
Execute
Output of the 'build' :
--- Compiling -----------------------------------------------
--- Build container -----------------------------------------
Sending build context to Docker daemon 75.15 MB
Sending build context to Docker daemon
Step 0 : FROM debian
---> 37c816ae4431
Step 1 : COPY helloworld .
---> bd81fc9a712e
Removing intermediate container 49887fffeea8
Step 2 : RUN chmod +x ./helloworld
---> Running in 52600566c5e8
---> 166b57edb033
Removing intermediate container 52600566c5e8
Step 3 : ENTRYPOINT ./helloworld
---> Running in 3e274e23d561
---> 952283f977f7
Removing intermediate container 3e274e23d561
Successfully built 952283f977f7
Output of running the container:
--- Run container -------------------------------------------
Yo CarréConfituurke, today is Saturday!
The aardvark.code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| ##== tmp/helloworld.go =========================================================
package main
import (
"fmt"
"time"
"os"
)
func main() {
day := time.Now().Weekday()
name:=os.Getenv("YOURNAME")
fmt.Printf("Yo %v, today is %v!\n", name, day)
}
##== tmp/helloworld.dockerfile =================================================
FROM debian
COPY helloworld .
ENTRYPOINT [ "./helloworld" ]
##== aardvark.sh ===============================================================
#!/bin/bash
echo "--- Compiling -----------------------------------------------"
go build tmp/helloworld.go
echo "--- Build container -----------------------------------------"
docker build -f tmp/helloworld.dockerfile -t helloworld:v1 .
echo "--- Run container -------------------------------------------"
docker run -e YOURNAME=CarréConfituurke helloworld:v1
|
Simple static webserver
What?
It's always handy to have the code of a simple static webserver lying about. This one is in Go, and serves some literature!
Prerequisite
To run this aardvark.code example you need to have following software installed on your system:
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/staticws/aardvark.code
Execute
Output of the run :
aardvark
Extract from Project Gutenberg EBook of War and Peace, by Leo Tolstoy
Not only the generals in full parade uniforms, with their thin or
thick waists drawn in to the utmost, their red necks squeezed into
their stiff collars, and wearing scarves and all their decorations,
not only the elegant, pomaded officers, but every soldier with his
freshly washed and shaven face and his weapons clean and polished to
the utmost, and every horse groomed till its coat shone like satin
and every hair of its wetted mane lay smooth--felt that no small
matter was happening, but an important and solemn affair. Every
general and every soldier was conscious of his own insignificance,
aware of being but a drop in that ocean of men, and yet at the same
time was conscious of his strength as a part of that enormous whole.
Extract from the Project Gutenberg EBook of The Complete Works of William Shakespeare
Friends, Romans, countrymen, lend me your ears!
I come to bury Caesar, not to praise him.
The evil that men do lives after them,
The good is oft interred with their bones;
So let it be with Caesar. The noble Brutus
Hath told you Caesar was ambitious;
If it were so, it was a grievous fault,
And grievously hath Caesar answer'd it.
Here, under leave of Brutus and the rest-
For Brutus is an honorable man;
So are they all, all honorable men-
Come I to speak in Caesar's funeral.
The aardvark.code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
| ##== tmp/staticws.go ========================================
package main
import (
"fmt"
"github.com/gorilla/mux"
"net/http"
"os"
"time"
)
func main() {
r := mux.NewRouter()
r.PathPrefix("/static/").Handler(http.StripPrefix("/static/", http.FileServer(http.Dir("./tmp"))))
srv := &http.Server{
Handler: r,
Addr: ":8642",
WriteTimeout: 15 * time.Second, // enforce timeouts for servers you create!
ReadTimeout: 15 * time.Second,
}
err:=srv.ListenAndServe()
if err!=nil {
fmt.Fprintf(os.Stderr, "Error starting server: %v\n" , err.Error())
}
}
##== tmp/war_and_peace.txt ========================================
Extract from Project Gutenberg EBook of War and Peace, by Leo Tolstoy
Not only the generals in full parade uniforms, with their thin or
thick waists drawn in to the utmost, their red necks squeezed into
their stiff collars, and wearing scarves and all their decorations,
not only the elegant, pomaded officers, but every soldier with his
freshly washed and shaven face and his weapons clean and polished to
the utmost, and every horse groomed till its coat shone like satin
and every hair of its wetted mane lay smooth--felt that no small
matter was happening, but an important and solemn affair. Every
general and every soldier was conscious of his own insignificance,
aware of being but a drop in that ocean of men, and yet at the same
time was conscious of his strength as a part of that enormous whole.
##== tmp/julius_caesar.txt ========================================
Extract from the Project Gutenberg EBook of The Complete Works of William Shakespeare
Friends, Romans, countrymen, lend me your ears!
I come to bury Caesar, not to praise him.
The evil that men do lives after them,
The good is oft interred with their bones;
So let it be with Caesar. The noble Brutus
Hath told you Caesar was ambitious;
If it were so, it was a grievous fault,
And grievously hath Caesar answer'd it.
Here, under leave of Brutus and the rest-
For Brutus is an honorable man;
So are they all, all honorable men-
Come I to speak in Caesar's funeral.
##== aardvark.sh ========================================
#!/bin/bash
go build tmp/staticws.go
./staticws &
curl http://localhost:8642/static/war_and_peace.txt
echo
curl http://localhost:8642/static/julius_caesar.txt
killall -u $USER staticws
|
Generate Go code using templates
What?
This go application reads in the user supplied definition (or structure) of a CSV file, and generates 'reader.go', a fully working go application that reads that CSV file. It does this using the nifty template feature of Go.
If you want to customize the resulting code, then just change the template reader.tpl . Or for a another CSV file, just change the description.txt .
Prerequisite
To run this aardvark.code example you need to have following software installed on your system:
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/codegen/aardvark.code
Execute
Stage 1
Read a .txt file and turn it into a Go Descriptor object which contains a Col object per column definition.
The structure as defined in the description.txt file
3
4
5
6
7
8
9
10
11
12
13
14
| EntityName:City
0 Geonameid int
1 Name string
2 Asciiname string
4 Lat float64 # latitude
5 Lon float64 # longitude
8 Country string
14 Population int
16 Elevation float64
Filename:/u01/data/20150102_cities/cities1000.txt
Separator:\t
|
.. will look like this Go Descriptor object
{
EntityName:City
Filename:/u01/data/20150102_cities/cities1000.txt
Separator:\t
Numcols:8
Cols:[
{Position:0 Identifier:Geonameid Type:int ConversionFlag:true }
{Position:1 Identifier:Name Type:string ConversionFlag:false}
{Position:2 Identifier:Asciiname Type:string ConversionFlag:false}
{Position:4 Identifier:Lat Type:float64 ConversionFlag:true }
{Position:5 Identifier:Lon Type:float64 ConversionFlag:true }
{Position:8 Identifier:Country Type:string ConversionFlag:false}
{Position:14 Identifier:Population Type:int ConversionFlag:true }
{Position:16 Identifier:Elevation Type:float64 ConversionFlag:true }
]
}
Stage 2
Read in the template 'reader.tpl', apply above data to it, and put the result in the file 'reader.go'.
I'm just going to lift the veil a bit, by showing what the following snippet of the template does, for more information about Go's template feature refer to the documentation: golang.org/pkg/text/template
When you apply the data to this template snipppet ..
type «.EntityName» struct {
«range .Cols» «.Identifier» «.Type»
«end»}
.. this output will be produced:
type City struct {
Geonameid int
Name string
Asciiname string
Lat float64
Lon float64
Country string
Population int
Elevation float64
}
For your own enlightenment, look for other «...» expressions in the 'reader.tpl' file, and see the code it generates in 'reader.go'.
Stage 3
Compile the 'reader.go' file, which reads in the CSV file, and shows a couple of records.
The aardvark.code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
| ##== tmp/description.txt ========================================
EntityName:City
0 Geonameid int
1 Name string
2 Asciiname string
4 Lat float64 # latitude
5 Lon float64 # longitude
8 Country string
14 Population int
16 Elevation float64
Filename:/u01/data/20150102_cities/cities1000.txt
Separator:\t
##== tmp/reader.tpl ========================================
package main
import (
"os"
"strconv"
"bufio"
"fmt"
"io"
"strings"
)
func main() {
filename:="«.Filename»"
f,err := os.Open(filename)
defer f.Close()
if err != nil {
fmt.Fprintf(os.Stderr, "Opening file %q: %s\n", filename,err.Error())
os.Exit(1)
}
r:=bufio.NewReader(f)
repeat:=true
ignoredLines:=0
list:=make([]«.EntityName»,0,0)
for repeat {
line,overflow,err := r.ReadLine()
repeat = (err!=io.EOF) // EOF means stop repeating this loop
if err != nil && err!=io.EOF {
fmt.Fprintf(os.Stderr, "Read error: %s\n", err.Error())
break
}
if overflow {
fmt.Fprintf(os.Stderr, "Overflow error on reading!\n")
break
}
recs:=strings.Split(string(line),"«.Separator»")
if len(recs)>«.Numcols» {
row,err:=extract( strings.Split(string(line),"«.Separator»"))
if err!=nil {
// assume error already reported
break
}
list=append(list,row)
} else {
ignoredLines+=1
}
}
if ignoredLines>0 {
fmt.Fprintf(os.Stderr, "Warning: %v line(s) ignored because of too few fields.\n",ignoredLines)
}
for i,r:=range(list) {
fmt.Printf("%v\n",r)
if i>10 {
break
}
}
}
type «.EntityName» struct {
«range .Cols» «.Identifier» «.Type»
«end»}
func extract(rec []string) (record «.EntityName»,err error) {
«range .Cols»«if .ConversionFlag»«template "convert" .»«else» _«.Identifier» := rec[«.Position»]«end»
«end»
record = «.EntityName»{ «range .Cols»«.Identifier»:_«.Identifier», «end»}
return
}
«define "convert"»«if eq .Type "int"» _«.Identifier»:=0
if len(rec[«.Position»])>0 {
_«.Identifier»,err=strconv.Atoi(rec[«.Position»])
if err != nil {
fmt.Fprintf(os.Stderr, "Error converting «.Identifier»: %v\n", err.Error())
return
}
}«end»«if eq .Type "float64"» _«.Identifier»:=0.0
if len(rec[«.Position»])>0 {
_«.Identifier»,err=strconv.ParseFloat(rec[«.Position»],64)
if err != nil {
fmt.Fprintf(os.Stderr, "Error converting «.Identifier»: %v\n", err.Error())
return
}
}«end»«end»
##== tmp/grok.go ========================================
package main
import (
"fmt"
"bufio"
"os"
"regexp"
"io/ioutil"
"strings"
"strconv"
"text/template"
)
type Descriptor struct {
EntityName string
Filename string
Separator string
Numcols int
Cols []Col
}
type Col struct {
Position int
Identifier string
Type string
ConversionFlag bool
}
func main() {
d,err:=getDescriptor("tmp/description.txt") // read the description
if err != nil {
os.Exit(1)
}
fmt.Printf("%+v\n",d) // DROPME TODO
f,err:=os.Create("tmp/reader.go") // prepare file for output
if err != nil {
fmt.Fprintf(os.Stderr, "File open error: %s\n", err.Error())
os.Exit(1)
}
defer f.Close()
w:=bufio.NewWriter(f)
t:=template.New("reader.tpl") // create template
t.Delims("«","»")
t=template.Must(t.ParseFiles("tmp/reader.tpl"))
err=t.Execute(w,d) // execute the template
if err!=nil {
fmt.Fprintf(os.Stderr, "Template execute error: %s\n", err.Error())
}
w.Flush()
}
func getDescriptor(filename string) (desc Descriptor, err error) {
desc=Descriptor{ EntityName:"x" }
content, err := ioutil.ReadFile(filename)
if err != nil {
fmt.Fprintf(os.Stderr, "File read error: %s\n", err.Error())
}
body:=strings.TrimSpace(strings.Replace( string(content), "\n","|",-1) )
// regular expressions matching 1) key:value pair 2) fields line
reKeyValue := regexp.MustCompile(`^\s*(\w+)\s*:\s*(\S+).*`)
reFields := regexp.MustCompile(`^\s*(\d+)\s*(\w+)\s*(\w+).*`)
desc.Cols = make([]Col, 0, 0)
for _,line:= range strings.Split(body,"|") {
if n:=strings.Index(line,"#"); n>-1 { // remove comments
line=line[:n]
}
line=strings.TrimSpace(line) // empty string?
if len(line)<=1 {
continue
}
group:=reKeyValue.FindStringSubmatch(line) // pattern: key:value
if group!=nil {
digestKeyValue(&desc, group)
continue
}
group=reFields.FindStringSubmatch(line) // pattern: num word word
if group!=nil {
err=digestFields(&desc, group)
if err!=nil {
break
}
}
}
desc.Numcols=len(desc.Cols)
return
}
func digestKeyValue(desc *Descriptor, group []string) {
k:= group[1]
v:= group[2]
if (k=="EntityName") {
desc.EntityName=v
} else if (k=="Filename") {
desc.Filename=v
} else if (k=="Separator") {
desc.Separator=v
} else {
fmt.Fprintf(os.Stderr, "WARNING: Key:Value pair %v:%v ignored\n", k,v)
}
}
func digestFields(desc *Descriptor, group []string) (err error) {
p,err:=strconv.Atoi(group[1])
if err != nil {
fmt.Fprintf(os.Stderr, "Conversion error: %s\n", err.Error())
return
}
id:=group[2]
desc.Cols=append(desc.Cols, Col{ Position:p,
Identifier:id,
Type: group[3],
ConversionFlag: group[3]!="string" })
return
}
##== aardvark.sh ========================================
#!/bin/bash
rm grok tmp/reader.go reader # cleanup
go build tmp/grok.go # build the code-generator
if [ -x ./grok ]
then
./grok
fi
if [ -f ./tmp/reader.go ]
then
go build tmp/reader.go # build the generated go code
./reader # and execute
fi
|
Aardvark in append mode
What?
Normally when a filename occurs more than once in an aardvark file, it will get overwritten. But if you precede the filename with a plus-sign, aardvark will not attempt to overwrite the existing file but append to it.
Prerequisite
To run this aardvark.code example you need to have following software installed on your system:
Go aardvark
Grab this aardvark.code file:
wget http://data.munging.ninja/aardvarkcode/append/aardvark.code
Look at the file
Spot the '+' signs in front of the filenames (eg +spanish.txt )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| ##== spanish.txt =========================================================
El lago es el segundo lago más grande de América del Sur,
localizado en la Patagonia y compartido por Chile y Argentina.
##== english.txt =========================================================
The lake is the second biggest lake of South America, located
in Pataganio and shared by Chile and Argentina.
##== nederlands.txt ======================================================
Het meer is het tweede grootste meer van Zuid Amerika, gelocaliseerd
in Patagonie and gedeeld door Chili en Argentinie.
##== +spanish.txt ========================================================
A cada lado de la frontera tiene nombres diferentes: en Chile es
conocido como lago General Carrera, mientras que
en Argentina se le denomina lago Buenos Aires.
##== +english.txt ========================================================
At both sides of the border it has a different name: in Chili it is
known as Lake General Carrera, while in Argentina it is named
Lake Buenos Aires.
##== +nederlands.txt =====================================================
Aan beide zijden van de grense heeft het een verschillende naam: in Chile
het is bekend als meer General Carrera, terwijl in Argentinie het
meer Buenos Aires benoemd werd.
|
Run aardvark
As you could have guessed, when you run aardvark on this file, you get 3 output files, each in a particular language.
Output: spanish.txt
1
2
3
4
5
6
| El lago es el segundo lago más grande de América del Sur,
localizado en la Patagonia y compartido por Chile y Argentina.
A cada lado de la frontera tiene nombres diferentes: en Chile es
conocido como lago General Carrera, mientras que
en Argentina se le denomina lago Buenos Aires.
|
Output: english.txt
1
2
3
4
5
6
| The lake is the second biggest lake of South America, located
in Pataganio and shared by Chile and Argentina.
At both sides of the border it has a different name: in Chili it is
known as Lake General Carrera, while in Argentina it is named
Lake Buenos Aires.
|
Output: nederlands.txt
1
2
3
4
5
6
| Het meer is het tweede grootste meer van Zuid Amerika, gelocaliseerd
in Patagonie and gedeeld door Chili en Argentinie.
Aan beide zijden van de grense heeft het een verschillende naam: in Chile
het is bekend als meer General Carrera, terwijl in Argentinie het
meer Buenos Aires benoemd werd.
|
| |