Data Munging Ninja
20200531 

Relational Data

Useful RDBMS data.

 

20190106 

Python Book

My collection of Python shorties and quickies.

 

20161206 

Convex Hull

The python implementation of an algorithm that decides which points of a collection make up the circumference or hull.

 

20160616 

Simple Sales Prediction

Dig deeper into Spark dataframes to predict the next months sale units, via simple linear regression. Get comfortable with dataframes functions for aggregation, defining new columns, dropping and renaming columns, joining dataframes on multiple keys, converting from a wide to a narrow format dataframe. Also learn how to create a simple User Defined Function (UDF).

 

20160515 

aardvark.code

Aardvark is about putting a bunch of code files together in one file, and executing one command to do all that's necessary to produce the desired output.
Stop being a manager of files, but concentrate on code writing!

 

20160428 

All Cylinders

How do I package a Scala program for submitting on my Spark cluster? And how do I check that the cluster is firing on all cylinders? (Note: using NYC Taxi data)

 

20160331 

Framed Data

How to do the same thing with data, using different technologies: SQL, R, Python, Spark. Simple operations like aggregation, frequency, etc...

 

20160120 

Simple Linear Regression: overview

The title says it all: simple linear regression overview. Homebrew implementation of the formula's, using libraries. Mostly in Python but also some R.

 

20151226 

Terentiaflores: from OSM PBF to Hive query

Cicero has upset his wife, and to make amends he is looking for the flower-shop nearest to his house on the Palantine hill (41.8898803,12.4849976).
Data: the OpenStreetMap pbf file of Europe (17 gigabyte).
Query tool of choice: Hive.
@dtmngngnnj