All Cylinders 
     
    
      
         
      
Data: NYC taxi rides of 2015 
The scala job in question is going to parse the New York city taxi data of 2015, and tally up the following:
how many rides 
how many miles 
how many passengers 
 
Go ahead and download the yellow cab trip sheet data from www.nyc.gov/html/tlc/html/about/trip_record_data.shtml  and put it on your HDFS.
Before you blow-up your data-pipeline a little warning about size: be aware that every file is between 1.7G and 2.0G, which brings the total to about about 22 gigabyte.
For a detailed description of the data fields, see: www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf 
A quick cut from the January file:
VID      pickup_datetime      dropoff_datetime #psngr dist  pickup_longitude      pickup_latitude    ..
1    2015-07-01 00:00:00   2015-07-01 00:15:26   1    3.50  -73.994155883789063   40.751125335693359
1    2015-07-01 00:00:00   2015-07-01 00:22:22   1    3.90  -73.984657287597656   40.768486022949219
1    2015-07-01 00:00:00   2015-07-01 00:07:42   1    2.30  -73.978889465332031   40.762287139892578
1    2015-07-01 00:00:00   2015-07-01 00:39:37   1    9.20  -73.992790222167969   40.742759704589844
1    2015-07-01 00:00:00   2015-07-01 00:05:34   1    1.10  -73.912429809570313   40.769809722900391
1    2015-07-01 00:00:00   2015-07-01 00:06:46   2    1.00  -73.959159851074219   40.773429870605469
2    2015-07-01 00:00:00   2015-07-01 00:36:57   2   19.12  -73.789459228515625   40.647258758544922
2    2015-07-01 00:00:00   2015-07-01 06:30:15   1     .00    0                    0
2    2015-07-01 00:00:00   2015-07-01 11:27:07   1    2.58  -73.998931884765625   40.744678497314453
2    2015-07-01 00:00:00   2015-07-01 00:00:00   1    1.07  -73.99383544921875    40.735431671142578We are interested in fields:
passenger_count (#psngr) 
trip_distance (dist)