Framed Data
 
04_frequency
20160601

Frequency

Count the occurrence of the city names, and list the top 20. Additional condition: population has to be greater than 100.

Data entity used: city (see section 2b data load ).

Query:

ct.filter( r => r.population>100).
   map( r => (r.asciiname, 1) ).
   reduceByKey( (_ + _)).
   sortBy(-_._2).
   take(20).
   foreach(println)

Result:

(San Antonio,31)
(San Miguel,31)
(San Francisco,28)
(San Jose,26)
(San Isidro,25)
(Santa Cruz,25)
(Buenavista,24)
(Clinton,24)
(Newport,24)
(San Vicente,23)
(Victoria,23)
(Santa Maria,23)
(Richmond,22)
(San Carlos,21)
(Santa Ana,21)
(Georgetown,21)
(San Pedro,20)
(Springfield,20)
(Franklin,20)
(Salem,19)

Another way is to use the RDD.countByValue function which for RDD[T] returns Map[T,Long]. BUT this turns our RDD into a scala.collection.Map[String,Long], ie. it's now a 'local' collection, and no longer distributed.

ct.filter( r => r.population>100).
   map( r => r.asciiname ).
   countByValue().
   toList.
   sortBy(-_._2).
   take(20).foreach(println)

Query:

city_df.filter(" population>100 ").
        groupBy("asciiname").count().
        orderBy(desc("count")).
        take(20).
        foreach(println)

Result:

[San Miguel,31]                                                                 
[San Antonio,31]
[San Francisco,28]
[San Jose,26]
[San Isidro,25]
[Santa Cruz,25]
[Clinton,24]
[Buenavista,24]
[Newport,24]
[Victoria,23]
[Santa Maria,23]
[San Vicente,23]
[Richmond,22]
[Santa Ana,21]
[San Carlos,21]
[Georgetown,21]
[Springfield,20]
[Franklin,20]
[San Pedro,20]
[Greenville,19]

Query:

sqlctx.sql("""
select asciiname, count(1) as cnt
from city   
where population>100
group by asciiname 
order by cnt desc
limit 20
""").collect().foreach(println)

Result:

[San Antonio,31]                                                                
[San Miguel,31]
[San Francisco,28]
[San Jose,26]
[San Isidro,25]
[Santa Cruz,25]
[Clinton,24]
[Buenavista,24]
[Newport,24]
[Santa Maria,23]
[Victoria,23]
[San Vicente,23]
[Richmond,22]
[San Carlos,21]
[Georgetown,21]
[Santa Ana,21]
[Franklin,20]
[Springfield,20]
[San Pedro,20]
[La Union,19]

Making use of the count function of plyr:

library(plyr)

ff=count(df[df$population>100,], "asciiname")
ff[order(-ff$freq)[1:20],]

          asciiname freq
75375   San Antonio   31
76514    San Miguel   31
75784 San Francisco   28
76042      San Jose   26
76000    San Isidro   25
77019    Santa Cruz   25
12900    Buenavista   24
19349       Clinton   24
59090       Newport   24
77184   Santa Maria   23
77720   San Vicente   23
92062      Victoria   23
71507      Richmond   22
30887    Georgetown   21
75521    San Carlos   21
76926     Santa Ana   21
29256      Franklin   20
76679     San Pedro   20
82832   Springfield   20
32755    Greenville   19
dt[population>100][,.N,by=asciiname][order(-N)][1:20]

        asciiname  N
 1:    San Miguel 31
 2:   San Antonio 31
 3: San Francisco 28
 4:      San Jose 26
 5:    San Isidro 25
 6:    Santa Cruz 25
 7:       Newport 24
 8:    Buenavista 24
 9:       Clinton 24
10:   San Vicente 23
11:   Santa Maria 23
12:      Victoria 23
13:      Richmond 22
14:     Santa Ana 21
15:    San Carlos 21
16:    Georgetown 21
17:     San Pedro 20
18:      Franklin 20
19:   Springfield 20
20:       Windsor 19

Applying value_counts() to a dataframe column returns a series.

df[df.population>100]["asciiname"].value_counts()[:20]

San Antonio      31
San Miguel       31
San Francisco    28
San Jose         26
Santa Cruz       25
San Isidro       25
Newport          24
Clinton          24
Buenavista       24
Santa Maria      23
San Vicente      23
Victoria         23
Richmond         22
San Carlos       21
Santa Ana        21
Georgetown       21
San Pedro        20
Franklin         20
Springfield      20
Greenville       19

Another way is to use the collections.Counter:

import collections
ctr=collections.Counter( df[df.population>100]["asciiname"])
ctr.most_common(20) 

[('San Miguel', 31),
 ('San Antonio', 31),
 ('San Francisco', 28),
 ('San Jose', 26),
 ('Santa Cruz', 25),
 ('San Isidro', 25),
 ('Buenavista', 24),
 ('Newport', 24),
 ('Clinton', 24),
 ('Santa Maria', 23),
 ('San Vicente', 23),
 ('Victoria', 23),
 ('Richmond', 22),
 ('San Carlos', 21),
 ('Santa Ana', 21),
 ('Georgetown', 21),
 ('San Pedro', 20),
 ('Franklin', 20),
 ('Springfield', 20),
 ('Salem', 19)]

postgres

select asciiname,count(1) 
from t_city 
where population>100
group by asciiname 
order by 2 desc 
limit 20

   asciiname   | count 
---------------+-------
 San Antonio   |    31
 San Miguel    |    31
 San Francisco |    28
 San Jose      |    26
 San Isidro    |    25
 Santa Cruz    |    25
 Buenavista    |    24
 Clinton       |    24
 Newport       |    24
 San Vicente   |    23
 Santa Maria   |    23
 Victoria      |    23
 Richmond      |    22
 Santa Ana     |    21
 Georgetown    |    21
 San Carlos    |    21
 Franklin      |    20
 Springfield   |    20
 San Pedro     |    20
 Greenville    |    19
(20 rows)
 
Notes by Data Munging Ninja. Generated on nini:sync/20151223_datamungingninja/frameddata at 2016-10-18 07:18