Data Munging Ninja

20200531

Python Book

My collection of Python shorties and quickies.

20161206

Convex Hull

The python implementation of an algorithm that decides which points of a collection make up the circumference or hull.

20160616

Dig deeper into Spark dataframes to predict the next months sale units, via simple linear regression. Get comfortable with dataframes functions for aggregation, defining new columns, dropping and renaming columns, joining dataframes on multiple keys, converting from a wide to a narrow format dataframe. Also learn how to create a simple User Defined Function (UDF).

20160515

aardvark.code

Aardvark is about putting a bunch of code files together in one file, and executing one command to do all that's necessary to produce the desired output.
Stop being a manager of files, but concentrate on code writing!

20160428

All Cylinders

How do I package a Scala program for submitting on my Spark cluster? And how do I check that the cluster is firing on all cylinders? (Note: using NYC Taxi data)

20160331

Framed Data

How to do the same thing with data, using different technologies: SQL, R, Python, Spark. Simple operations like aggregation, frequency, etc...

20160120

Simple Linear Regression: overview

The title says it all: simple linear regression overview. Homebrew implementation of the formula's, using libraries. Mostly in Python but also some R.

20151226

Terentiaflores: from OSM PBF to Hive query

Cicero has upset his wife, and to make amends he is looking for the flower-shop nearest to his house on the Palantine hill (41.8898803,12.4849976).
Data: the OpenStreetMap pbf file of Europe (17 gigabyte).
Query tool of choice: Hive.

Relational Data