Framed Data
 
03_aggregate
20160601

Aggregate

Simple aggregate: sum the population of the cities for the EU28 countries.

Data entity used: city (see section 2b data load ).

Scala spark

Prep:

val eu28=List("AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
              "GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
              "PL", "PT", "RO", "SE", "SI", "SK", "AN")

Query:

ct.filter( r => (eu28 contains r.country)).  // retain cities of the 28 EU countries
   map( r => (r.country, r.population)).     // create (country, population) tuples
   reduceByKey( (_ + _)).                    // sum up
   sortBy(-_._2).                            // sort by value, negative sign for desc
   foreach( println)

(DE,85880759)
(GB,63628475)
(FR,52643435)
(IT,52376411)
(ES,49756525)
(PL,28776423)
(RO,25000006)
(NL,15022475)
(HU,10263483)
(BE,10117760)
(CZ,8720141)
(GR,8562837)
(SE,7803509)
(PT,7097218)
(BG,5457463)
(FI,5178753)
(AT,4924993)
(DK,4475046)
(HR,3744956)
(IE,3548735)
(SK,2971938)
(LT,2755868)
(LV,1735119)
(SI,1183740)
(EE,995124)
(CY,797327)
(MT,398419)
(LU,358224)

Query:

sqlctx.sql("""select country, sum(population) as sum_pop 
              from city  
              where country in ('AT','BE','BG','CY','CZ','DE','DK','EE','ES','FI','FR',
                         'GB','GR','HR','HU','IE','IT','LT','LU','LV','MT','NL',
                         'PL','PT','RO','SE','SI','SK','AN')
              group by country
              order by sum_pop desc""").collect().foreach(println)

Result:

[DE,85880759]
[GB,63628475]
[FR,52643435]
[IT,52376411]
[ES,49756525]
[PL,28776423]
[RO,25000006]
[NL,15022475]
[HU,10263483]
[BE,10117760]
[CZ,8720141]
[GR,8562837]
[SE,7803509]
[PT,7097218]
[BG,5457463]
[FI,5178753]
[AT,4924993]
[DK,4475046]
[HR,3744956]
[IE,3548735]
[SK,2971938]
[LT,2755868]
[LV,1735119]
[SI,1183740]
[EE,995124]
[CY,797327]
[MT,398419]
[LU,358224]

The list of countries that make-up the EU28 :

eu28=c("AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
       "GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
       "PL", "PT", "RO", "SE", "SI", "SK", "AN")

Create a subset:

eudf=df[df$country %in% eu28,]

dim(eudf)
57251     8

Aggregate:

aggregate( eudf$population, by=list(country=eudf$country),sum)  

   country        x
1       AT  4924993
2       BE 10117760
3       BG  5457463
4       CY   797327
5       CZ  8720141
6       DE 85880759
7       DK  4475046
8       EE   995124
9       ES 49756525
10      FI  5178753
11      FR 52643435
12      GB 63628475
13      GR  8562837
14      HR  3744956
15      HU 10263483
16      IE  3548735
17      IT 52376411
18      LT  2755868
19      LU   358224
20      LV  1735119
21      MT   398419
22      NL 15022475
23      PL 28776423
24      PT  7097218
25      RO 25000006
26      SE  7803509
27      SI  1183740
28      SK  2971938

Prep:

eu28=c("AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
       "GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
       "PL", "PT", "RO", "SE", "SI", "SK", "AN")

For data.tables both operations (subsetting and aggregation) can be written as a one-liner:

 dt[country %in% eu28][,sum(population),by=country]

    country       V1
 1:      AT  4924993
 2:      BE 10117760
 3:      BG  5457463
 4:      CY   797327
 5:      CZ  8720141
 6:      DE 85880759
 7:      DK  4475046
 8:      EE   995124
 9:      ES 49756525
10:      FI  5178753
11:      FR 52643435
12:      GB 63628475
13:      GR  8562837
14:      HR  3744956
15:      HU 10263483
16:      IE  3548735
17:      IT 52376411
18:      LT  2755868
19:      LU   358224
20:      LV  1735119
21:      MT   398419
22:      NL 15022475
23:      PL 28776423
24:      PT  7097218
25:      RO 25000006
26:      SE  7803509
27:      SI  1183740
28:      SK  2971938

Python

Prep:

eu28=set(["AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FR",
          "GB", "GR", "HR", "HU", "IE", "IT", "LT", "LU", "LV", "MT", "NL",
          "PL", "PT", "RO", "SE", "SI", "SK", "AN"])

Agg:

df[df.country.isin(eu28)][["country","population"]].groupby(['country']).sum()

country  population
AT          4924993
BE         10117760
BG          5457463
CY           797327
CZ          8720141
DE         85880759
DK          4475046
EE           995124
ES         49756525
FI          5178753
FR         52643435
GB         50690526
GR          8562837
HR          3744956
HU         10263483
IE          3548735
IT         52376411
LT          2755868
LU           358224
LV          1735119
MT           398419
NL         15022475
PL         28776423
PT          7097218
RO         25000006
SE          7803509
SI          1183740
SK          2971938

Postgres:

select country, sum(population) as sum_pop
from t_city 
where country in ('AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'ES', 'FI', 'FR',
         'GB', 'GR', 'HR', 'HU', 'IE', 'IT', 'LT', 'LU', 'LV', 'MT', 'NL',
         'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'AN')
group by country
order by 2 desc;

Result:

 country | sum_pop  
---------+----------
 DE      | 85880759
 GB      | 63628475
 FR      | 52643435
 IT      | 52376411
 ES      | 49756525
 PL      | 28776423
 RO      | 25000006
 NL      | 15022475
 HU      | 10263483
 BE      | 10117760
 CZ      |  8720141
 GR      |  8562837
 SE      |  7803509
 PT      |  7097218
 BG      |  5457463
 FI      |  5178753
 AT      |  4924993
 DK      |  4475046
 HR      |  3744956
 IE      |  3548735
 SK      |  2971938
 LT      |  2755868
 LV      |  1735119
 SI      |  1183740
 EE      |   995124
 CY      |   797327
 MT      |   398419
 LU      |   358224
(28 rows)
 
Notes by Data Munging Ninja. Generated on nini:sync/20151223_datamungingninja/frameddata at 2016-10-18 07:18