Data Munging Ninja

Relational Data

Useful RDBMS data.



Python Book

My collection of Python shorties and quickies.



Convex Hull

The python implementation of an algorithm that decides which points of a collection make up the circumference or hull.



Simple Sales Prediction

Dig deeper into Spark dataframes to predict the next months sale units, via simple linear regression. Get comfortable with dataframes functions for aggregation, defining new columns, dropping and renaming columns, joining dataframes on multiple keys, converting from a wide to a narrow format dataframe. Also learn how to create a simple User Defined Function (UDF).




Aardvark is about putting a bunch of code files together in one file, and executing one command to do all that's necessary to produce the desired output.
Stop being a manager of files, but concentrate on code writing!



All Cylinders

How do I package a Scala program for submitting on my Spark cluster? And how do I check that the cluster is firing on all cylinders? (Note: using NYC Taxi data)



Framed Data

How to do the same thing with data, using different technologies: SQL, R, Python, Spark. Simple operations like aggregation, frequency, etc...



Simple Linear Regression: overview

The title says it all: simple linear regression overview. Homebrew implementation of the formula's, using libraries. Mostly in Python but also some R.



Terentiaflores: from OSM PBF to Hive query

Cicero has upset his wife, and to make amends he is looking for the flower-shop nearest to his house on the Palantine hill (41.8898803,12.4849976).
Data: the OpenStreetMap pbf file of Europe (17 gigabyte).
Query tool of choice: Hive.