Terentia Flores
 
03_query
20151226

Hive queries

Table with all data: t_europe

For this we assume that the ingest has taken place (details: see next section), and the CSV files have been put under directory '/user/dmn/europe' on HDFS.

First create the table, linked to the CSV files on HDFS:

create external table t_europe 
(
    lat float,
    lon float,
    name string
)
row format delimited
fields terminated by '\t'
stored as textfile
location '/user/dmn/europe'
;

First query: how many waypoints do we have?

select count(1) from t_europe;
..
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 19  Reduce: 1   Cumulative CPU: 171.6 sec   ..
Total MapReduce CPU Time Spent: 2 minutes 51 seconds 600 msec
..
64759895

The 'rough distance' function

For finding out the distance between two waypoints, there is no standard function in Hive, so we write a User Defined Function in Java, compile it, and put it in a jar named 'udf.jar' (more details: see section 'hive_udf'). Then we incorporate it into Hive as follows:

ADD JAR udf.jar;
CREATE TEMPORARY FUNCTION UDF_ROUGH_DISTANCE as 'UdfRoughDistance';

Now we can make use of the function as shown in this query that answers how many waypoints we have in a 10km radius around the Palantine hill (lat,lon= 41.8898803,12.4849976) in Rome:

select count(1) from t_europe
where lat is not null
and lon is not null
and udf_rough_distance(41.8898803,12.4849976, lat,lon) < 10
;

Answer:

37189

Table with subset of data: t_palantine

Let's put the waypoints close to the Palantine hill in a separate table. We also want to have a column 'distance' being the distance between the Palantine hill and all the points. At the same time we also want to filter by 'distance' :

create table t_palantine as
select lat,
    lon,
    udf_rough_distance(41.8898803,12.4849976, lat,lon) as dist_km,
    name
from t_europe
where lat is not null
and   lon is not null
and  udf_rough_distance(41.8898803,12.4849976, lat,lon)<10.0
;

This takes a considerable amount of time: 7 minutes 17 seconds 410 msec. Maybe because the distance has to be calculated twice? Let's use a CTE (a common table expression) in our SQL, to avoid that.

drop table t_palantine;

create table t_palantine as
with ct as (
    select lat,lon,
        udf_rough_distance(41.8898803,12.4849976, lat,lon) as dist_km,
        name
    from t_europe
    where lat is not null
    and   lon is not null
) 
select lat,lon,dist_km,name 
from ct
where dist_km<10
;

This only takes: 73.209 seconds. Big improvement!

Flowers for Terentia

Let's now answer the question: which flower shops are near Cicero's on the Palantine hill?

select round(dist_km,2) as dist_km, name 
from t_palantine 
where lower(name) like '%flor%'
order by dist_km asc
limit 10
;

The result:

0.88    name:Pizzeria Florida shop:bakery phone:+39 06 68803236 operator:Fiori ..
1.19    name:Flor gelato italiano amenity:ice_cream website:www.gelatiflor.it a..
2.16    name:Hotel Floridia tourism:hotel
2.16    website:http://www.hotelfloraroma.com/ operator:Marriott addr:street:Vi..
2.17    name:florian's amenity:restaurant
2.29    name:Vivaio piante shop:florist
3.21    shop:florist
3.3     shop:florist
3.41    shop:florist
3.44    shop:florist
Time taken: 24.486 seconds, Fetched: 10 row(s)

His best bet would be 'Vivaio Piante' at 2.29 km:

(picture belongs to Google maps)

 
Notes by Data Munging Ninja. Generated on nini:sync/20151223_datamungingninja/terentiaflores at 2016-10-18 07:18