All Cylinders
 
02_data
20160428

Data: NYC taxi rides of 2015

The scala job in question is going to parse the New York city taxi data of 2015, and tally up the following:

  • how many rides
  • how many miles
  • how many passengers

Go ahead and download the yellow cab trip sheet data from www.nyc.gov/html/tlc/html/about/trip_record_data.shtml and put it on your HDFS.

Before you blow-up your data-pipeline a little warning about size: be aware that every file is between 1.7G and 2.0G, which brings the total to about about 22 gigabyte.

For a detailed description of the data fields, see: www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

A quick cut from the January file:

VID      pickup_datetime      dropoff_datetime #psngr dist  pickup_longitude      pickup_latitude    ..
1    2015-07-01 00:00:00   2015-07-01 00:15:26   1    3.50  -73.994155883789063   40.751125335693359
1    2015-07-01 00:00:00   2015-07-01 00:22:22   1    3.90  -73.984657287597656   40.768486022949219
1    2015-07-01 00:00:00   2015-07-01 00:07:42   1    2.30  -73.978889465332031   40.762287139892578
1    2015-07-01 00:00:00   2015-07-01 00:39:37   1    9.20  -73.992790222167969   40.742759704589844
1    2015-07-01 00:00:00   2015-07-01 00:05:34   1    1.10  -73.912429809570313   40.769809722900391
1    2015-07-01 00:00:00   2015-07-01 00:06:46   2    1.00  -73.959159851074219   40.773429870605469
2    2015-07-01 00:00:00   2015-07-01 00:36:57   2   19.12  -73.789459228515625   40.647258758544922
2    2015-07-01 00:00:00   2015-07-01 06:30:15   1     .00    0                    0
2    2015-07-01 00:00:00   2015-07-01 11:27:07   1    2.58  -73.998931884765625   40.744678497314453
2    2015-07-01 00:00:00   2015-07-01 00:00:00   1    1.07  -73.99383544921875    40.735431671142578

We are interested in fields:

  • passenger_count (#psngr)
  • trip_distance (dist)
 
Notes by Data Munging Ninja. Generated on nini:sync/20151223_datamungingninja/allcylinders at 2016-10-18 07:19