Framed Data
 
04_frequency
20160601

Frequency

Count the occurrence of the city names, and list the top 20. Additional condition: population has to be greater than 100.

Data entity used: city (see section 2b data load ).

Query:

ct.filter( r => r.population>100).
   map( r => (r.asciiname, 1) ).
   reduceByKey( (_ + _)).
   sortBy(-_._2).
   take(20).
   foreach(println)

Result:

(San Antonio,31)
(San Miguel,31)
(San Francisco,28)
(San Jose,26)
(San Isidro,25)
(Santa Cruz,25)
(Buenavista,24)
(Clinton,24)
(Newport,24)
(San Vicente,23)
(Victoria,23)
(Santa Maria,23)
(Richmond,22)
(San Carlos,21)
(Santa Ana,21)
(Georgetown,21)
(San Pedro,20)
(Springfield,20)
(Franklin,20)
(Salem,19)

Another way is to use the RDD.countByValue function which for RDD[T] returns Map[T,Long]. BUT this turns our RDD into a scala.collection.Map[String,Long], ie. it's now a 'local' collection, and no longer distributed.

ct.filter( r => r.population>100).
   map( r => r.asciiname ).
   countByValue().
   toList.
   sortBy(-_._2).
   take(20).foreach(println)
 
Notes by Data Munging Ninja. Generated on nini:sync/20151223_datamungingninja/frameddata at 2016-10-18 07:18