Framed Data
 
02_load
20160329

Load the data from a csv file

  • the dataset cities1000.txt can be downloaded in a zipfile from geonames.org.
  • this file contains one line for every city with a population greater than 1000. For more information see 'geoname' table stub on above link
  • fields are separated by tabs
  • some fields will be ignored

Startup the Spark shell, and load the data file, into an RDD[String]:

$ spark-shell

var tx=sc.textFile("file:///home/dmn/city_data/cities1000.txt")  

tx.count()
Long = 145725

Define a case class, and a parse function :

case class City(
        geonameid: Int, 
        name: String, 
        asciiname: String, 
        latitude: Double, longitude: Double, 
        country: String, 
        population: Int, 
        elevation: Int) 

def parse(line: String) = { 
  val spl=line.split("\t") 
  val geonameid=spl(0).toInt
  val name=spl(1)
  val asciiname=spl(2)
  val latitude=spl(4).toDouble
  val longitude=spl(5).toDouble
  val country=spl(8)
  val population=spl(14).toInt
  val elevation=spl(16).toInt
  City(geonameid, name, asciiname, latitude, longitude, country, population, elevation)
}

Try and parse 1 line:

parse(tx.take(1)(0))
City = City(3039154,El Tarter,El Tarter,42.57952,1.65362,AD,1052,1721)

Success! Now let's parse the complete text file into City records:

var ct=tx.map(parse(_))

Check:

ct.count
Long = 145725

Spot-check: list all cities above 3500m and having a population of more than 100000, ordered by descending elevation:

var chk=ct.filter( rec => ( rec.elevation>3500) && (rec.population>100000)).collect()
chk.sortWith( (x,y) => (x.elevation>y.elevation) ).foreach(println)

City(3907584,Potosí,Potosi,-19.58361,-65.75306,BO,141251,3967)
City(3909234,Oruro,Oruro,-17.98333,-67.15,BO,208684,3936)
City(3937513,Juliaca,Juliaca,-15.5,-70.13333,PE,245675,3834)
City(3931276,Puno,Puno,-15.8422,-70.0199,PE,116552,3825)
City(3911925,La Paz,La Paz,-16.5,-68.15,BO,812799,3782)
City(1280737,Lhasa,Lhasa,29.65,91.1,CN,118721,3651)

That concludes the loading!

 
Notes by Data Munging Ninja. Generated on nini:sync/20151223_datamungingninja/frameddata at 2016-10-18 07:18