|
Frequency
Count the occurrence of the city names, and list the top 20. Additional condition: population has to be greater than 100.
Data entity used: city (see section 2b data load ).
Query:
ct.filter( r => r.population>100).
map( r => (r.asciiname, 1) ).
reduceByKey( (_ + _)).
sortBy(-_._2).
take(20).
foreach(println)
Result:
(San Antonio,31)
(San Miguel,31)
(San Francisco,28)
(San Jose,26)
(San Isidro,25)
(Santa Cruz,25)
(Buenavista,24)
(Clinton,24)
(Newport,24)
(San Vicente,23)
(Victoria,23)
(Santa Maria,23)
(Richmond,22)
(San Carlos,21)
(Santa Ana,21)
(Georgetown,21)
(San Pedro,20)
(Springfield,20)
(Franklin,20)
(Salem,19)
Another way is to use the RDD.countByValue function which for RDD[T] returns Map[T,Long]. BUT this turns our RDD into a scala.collection.Map[String,Long] , ie. it's now a 'local' collection, and no longer distributed.
ct.filter( r => r.population>100).
map( r => r.asciiname ).
countByValue().
toList.
sortBy(-_._2).
take(20).foreach(println)
Query:
city_df.filter(" population>100 ").
groupBy("asciiname").count().
orderBy(desc("count")).
take(20).
foreach(println)
Result:
[San Miguel,31]
[San Antonio,31]
[San Francisco,28]
[San Jose,26]
[San Isidro,25]
[Santa Cruz,25]
[Clinton,24]
[Buenavista,24]
[Newport,24]
[Victoria,23]
[Santa Maria,23]
[San Vicente,23]
[Richmond,22]
[Santa Ana,21]
[San Carlos,21]
[Georgetown,21]
[Springfield,20]
[Franklin,20]
[San Pedro,20]
[Greenville,19]
Query:
sqlctx.sql("""
select asciiname, count(1) as cnt
from city
where population>100
group by asciiname
order by cnt desc
limit 20
""").collect().foreach(println)
Result:
[San Antonio,31]
[San Miguel,31]
[San Francisco,28]
[San Jose,26]
[San Isidro,25]
[Santa Cruz,25]
[Clinton,24]
[Buenavista,24]
[Newport,24]
[Santa Maria,23]
[Victoria,23]
[San Vicente,23]
[Richmond,22]
[San Carlos,21]
[Georgetown,21]
[Santa Ana,21]
[Franklin,20]
[Springfield,20]
[San Pedro,20]
[La Union,19]
Making use of the count function of plyr :
library(plyr)
ff=count(df[df$population>100,], "asciiname")
ff[order(-ff$freq)[1:20],]
asciiname freq
75375 San Antonio 31
76514 San Miguel 31
75784 San Francisco 28
76042 San Jose 26
76000 San Isidro 25
77019 Santa Cruz 25
12900 Buenavista 24
19349 Clinton 24
59090 Newport 24
77184 Santa Maria 23
77720 San Vicente 23
92062 Victoria 23
71507 Richmond 22
30887 Georgetown 21
75521 San Carlos 21
76926 Santa Ana 21
29256 Franklin 20
76679 San Pedro 20
82832 Springfield 20
32755 Greenville 19
dt[population>100][,.N,by=asciiname][order(-N)][1:20]
asciiname N
1: San Miguel 31
2: San Antonio 31
3: San Francisco 28
4: San Jose 26
5: San Isidro 25
6: Santa Cruz 25
7: Newport 24
8: Buenavista 24
9: Clinton 24
10: San Vicente 23
11: Santa Maria 23
12: Victoria 23
13: Richmond 22
14: Santa Ana 21
15: San Carlos 21
16: Georgetown 21
17: San Pedro 20
18: Franklin 20
19: Springfield 20
20: Windsor 19
Applying value_counts() to a dataframe column returns a series.
df[df.population>100]["asciiname"].value_counts()[:20]
San Antonio 31
San Miguel 31
San Francisco 28
San Jose 26
Santa Cruz 25
San Isidro 25
Newport 24
Clinton 24
Buenavista 24
Santa Maria 23
San Vicente 23
Victoria 23
Richmond 22
San Carlos 21
Santa Ana 21
Georgetown 21
San Pedro 20
Franklin 20
Springfield 20
Greenville 19
Another way is to use the collections.Counter :
import collections
ctr=collections.Counter( df[df.population>100]["asciiname"])
ctr.most_common(20)
[('San Miguel', 31),
('San Antonio', 31),
('San Francisco', 28),
('San Jose', 26),
('Santa Cruz', 25),
('San Isidro', 25),
('Buenavista', 24),
('Newport', 24),
('Clinton', 24),
('Santa Maria', 23),
('San Vicente', 23),
('Victoria', 23),
('Richmond', 22),
('San Carlos', 21),
('Santa Ana', 21),
('Georgetown', 21),
('San Pedro', 20),
('Franklin', 20),
('Springfield', 20),
('Salem', 19)]
postgres
select asciiname,count(1)
from t_city
where population>100
group by asciiname
order by 2 desc
limit 20
asciiname | count
---------------+-------
San Antonio | 31
San Miguel | 31
San Francisco | 28
San Jose | 26
San Isidro | 25
Santa Cruz | 25
Buenavista | 24
Clinton | 24
Newport | 24
San Vicente | 23
Santa Maria | 23
Victoria | 23
Richmond | 22
Santa Ana | 21
Georgetown | 21
San Carlos | 21
Franklin | 20
Springfield | 20
San Pedro | 20
Greenville | 19
(20 rows)
| |