|
Aggregate
Simple aggregate: sum the population of the cities for the EU28 countries.
Data entity used: city (see section 2b data load ).
Scala spark
Prep:
val eu28=List("AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
"GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
"PL", "PT", "RO", "SE", "SI", "SK", "AN")
Query:
ct.filter( r => (eu28 contains r.country)). // retain cities of the 28 EU countries
map( r => (r.country, r.population)). // create (country, population) tuples
reduceByKey( (_ + _)). // sum up
sortBy(-_._2). // sort by value, negative sign for desc
foreach( println)
(DE,85880759)
(GB,63628475)
(FR,52643435)
(IT,52376411)
(ES,49756525)
(PL,28776423)
(RO,25000006)
(NL,15022475)
(HU,10263483)
(BE,10117760)
(CZ,8720141)
(GR,8562837)
(SE,7803509)
(PT,7097218)
(BG,5457463)
(FI,5178753)
(AT,4924993)
(DK,4475046)
(HR,3744956)
(IE,3548735)
(SK,2971938)
(LT,2755868)
(LV,1735119)
(SI,1183740)
(EE,995124)
(CY,797327)
(MT,398419)
(LU,358224)
Query:
sqlctx.sql("""select country, sum(population) as sum_pop
from city
where country in ('AT','BE','BG','CY','CZ','DE','DK','EE','ES','FI','FR',
'GB','GR','HR','HU','IE','IT','LT','LU','LV','MT','NL',
'PL','PT','RO','SE','SI','SK','AN')
group by country
order by sum_pop desc""").collect().foreach(println)
Result:
[DE,85880759]
[GB,63628475]
[FR,52643435]
[IT,52376411]
[ES,49756525]
[PL,28776423]
[RO,25000006]
[NL,15022475]
[HU,10263483]
[BE,10117760]
[CZ,8720141]
[GR,8562837]
[SE,7803509]
[PT,7097218]
[BG,5457463]
[FI,5178753]
[AT,4924993]
[DK,4475046]
[HR,3744956]
[IE,3548735]
[SK,2971938]
[LT,2755868]
[LV,1735119]
[SI,1183740]
[EE,995124]
[CY,797327]
[MT,398419]
[LU,358224]
The list of countries that make-up the EU28 :
eu28=c("AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
"GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
"PL", "PT", "RO", "SE", "SI", "SK", "AN")
Create a subset:
eudf=df[df$country %in% eu28,]
dim(eudf)
57251 8
Aggregate:
aggregate( eudf$population, by=list(country=eudf$country),sum)
country x
1 AT 4924993
2 BE 10117760
3 BG 5457463
4 CY 797327
5 CZ 8720141
6 DE 85880759
7 DK 4475046
8 EE 995124
9 ES 49756525
10 FI 5178753
11 FR 52643435
12 GB 63628475
13 GR 8562837
14 HR 3744956
15 HU 10263483
16 IE 3548735
17 IT 52376411
18 LT 2755868
19 LU 358224
20 LV 1735119
21 MT 398419
22 NL 15022475
23 PL 28776423
24 PT 7097218
25 RO 25000006
26 SE 7803509
27 SI 1183740
28 SK 2971938
Prep:
eu28=c("AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
"GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
"PL", "PT", "RO", "SE", "SI", "SK", "AN")
For data.tables both operations (subsetting and aggregation) can be written as a one-liner:
dt[country %in% eu28][,sum(population),by=country]
country V1
1: AT 4924993
2: BE 10117760
3: BG 5457463
4: CY 797327
5: CZ 8720141
6: DE 85880759
7: DK 4475046
8: EE 995124
9: ES 49756525
10: FI 5178753
11: FR 52643435
12: GB 63628475
13: GR 8562837
14: HR 3744956
15: HU 10263483
16: IE 3548735
17: IT 52376411
18: LT 2755868
19: LU 358224
20: LV 1735119
21: MT 398419
22: NL 15022475
23: PL 28776423
24: PT 7097218
25: RO 25000006
26: SE 7803509
27: SI 1183740
28: SK 2971938
Python
Prep:
eu28=set(["AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
"GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
"PL", "PT", "RO", "SE", "SI", "SK", "AN"])
Agg:
df[df.country.isin(eu28)][["country","population"]].groupby(['country']).sum()
country population
AT 4924993
BE 10117760
BG 5457463
CY 797327
CZ 8720141
DE 85880759
DK 4475046
EE 995124
ES 49756525
FI 5178753
FR 52643435
GB 50690526
GR 8562837
HR 3744956
HU 10263483
IE 3548735
IT 52376411
LT 2755868
LU 358224
LV 1735119
MT 398419
NL 15022475
PL 28776423
PT 7097218
RO 25000006
SE 7803509
SI 1183740
SK 2971938
Postgres:
select country, sum(population) as sum_pop
from t_city
where country in ('AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR',
'GB', 'GR', 'HR', 'HU', 'IE', 'IT', 'LT', 'LU', 'LV', 'MT', 'NL',
'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'AN')
group by country
order by 2 desc;
Result:
country | sum_pop
---------+----------
DE | 85880759
GB | 63628475
FR | 52643435
IT | 52376411
ES | 49756525
PL | 28776423
RO | 25000006
NL | 15022475
HU | 10263483
BE | 10117760
CZ | 8720141
GR | 8562837
SE | 7803509
PT | 7097218
BG | 5457463
FI | 5178753
AT | 4924993
DK | 4475046
HR | 3744956
IE | 3548735
SK | 2971938
LT | 2755868
LV | 1735119
SI | 1183740
EE | 995124
CY | 797327
MT | 398419
LU | 358224
(28 rows)
| |