All Cylinders
Data: NYC taxi rides of 2015
The scala job in question is going to parse the New York city taxi data of 2015, and tally up the following:
how many rides
how many miles
how many passengers
Go ahead and download the yellow cab trip sheet data from www.nyc.gov/html/tlc/html/about/trip_record_data.shtml and put it on your HDFS.
Before you blow-up your data-pipeline a little warning about size: be aware that every file is between 1.7G and 2.0G, which brings the total to about about 22 gigabyte.
For a detailed description of the data fields, see: www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
A quick cut from the January file:
VID pickup_datetime dropoff_datetime #psngr dist pickup_longitude pickup_latitude ..
1 2015-07-01 00:00:00 2015-07-01 00:15:26 1 3.50 -73.994155883789063 40.751125335693359
1 2015-07-01 00:00:00 2015-07-01 00:22:22 1 3.90 -73.984657287597656 40.768486022949219
1 2015-07-01 00:00:00 2015-07-01 00:07:42 1 2.30 -73.978889465332031 40.762287139892578
1 2015-07-01 00:00:00 2015-07-01 00:39:37 1 9.20 -73.992790222167969 40.742759704589844
1 2015-07-01 00:00:00 2015-07-01 00:05:34 1 1.10 -73.912429809570313 40.769809722900391
1 2015-07-01 00:00:00 2015-07-01 00:06:46 2 1.00 -73.959159851074219 40.773429870605469
2 2015-07-01 00:00:00 2015-07-01 00:36:57 2 19.12 -73.789459228515625 40.647258758544922
2 2015-07-01 00:00:00 2015-07-01 06:30:15 1 .00 0 0
2 2015-07-01 00:00:00 2015-07-01 11:27:07 1 2.58 -73.998931884765625 40.744678497314453
2 2015-07-01 00:00:00 2015-07-01 00:00:00 1 1.07 -73.99383544921875 40.735431671142578
We are interested in fields:
passenger_count (#psngr)
trip_distance (dist)