All Cylinders -

You have a development Spark Cluster, running on 4 Xen virtual images (named nk01, nk02.. ) on one and the same dom0 host (nk00) :

You want to write a Scala Spark job that you submit on the cluster, while monitoring the load to see that all 4 nodes are pulling their weight!

Data: NYC taxi rides of 2015

The scala job in question is going to parse the New York city taxi data of 2015, and tally up the following:

Before you blow-up your data-pipeline a little warning about size: be aware that every file is between 1.7G and 2.0G, which brings the total to about about 22 gigabyte.

If you'd run the scala code only in the Spark shell, then this would suffice:

    // load the data
    val taxi_file=sc.textFile("path-to-your-data-files") 

    // for every line in the file (except the header), split it into fields, 
    // and 'emit' a tuple containing `(1, distance, num_passengers)` : 
    val ride=taxi_file.filter( !_.startsWith("VendorID") ).
                       map( line => {
                            val spl=line.split(",")

                            // 1, meter_miles, num_passenger
                            ( 1, spl(4).toDouble, spl(3).toInt ) 
                       })  

    // sum up
    val tuple=ride.reduce( (a,b) => (a._1+b._1, a._2+b._2, a._3+b._3))
    println(s"Totals: ${tuple}")

Output:

(146112989,1.9195264796499913E9,245566747)

Which is 146 million taxi-rides, covering 2 billion miles, carrying 245 million passengers.

To be submittable as a job on the cluster, the code needs to be encapsulated it as follows:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger

import java.time.LocalDate
import java.time.format.DateTimeFormatter

object Taxi {
    def main(arg: Array[String]) {
        var logger = Logger.getLogger(this.getClass())

        // Arguments
        if (arg.length < 1) {
            logger.error("No input path!")
            System.err.println("No input path!")
            System.exit(1)
        }
        val inpath = arg(0)

        // setup sparkcontext
        val jobname = "Taxi"
        val conf = new SparkConf().setAppName(jobname)
        val sc = new SparkContext(conf)

        logger.info(s"Job: ${jobname} Path: ${inpath}")

        // the query 
        val taxi_file=sc.textFile(inpath)

        val ride=taxi_file.filter( !_.startsWith("VendorID") ).
                           map( line => {
                                val spl=line.split(",")

                                // 1, meter_miles, num_passenger
                                ( 1, spl(4).toDouble, spl(3).toInt ) 
                           })  

        val tuple=ride.reduce( (a,b) => (a._1+b._1, a._2+b._2, a._3+b._3))
    
        println(s"Totals: ${tuple}") 
    
    }
}

SBT

We're building with sbt (the scala build tool). Download it, and install it on any of your systems. It will download all the necessary dependencies.

The code file and the skeleton files for the sbt build can be found in this zip: sbt_taxi.zip

Files:

Taxi.scala              
build.sbt               
project/assembly.sbt

File: Taxi.scala

The scala query code. See prior tab.

File: build.sbt

Just plain sbt stuff:

mainClass in assembly := Some("Taxi")
jarName in assembly := "taxi.jar"

lazy val root = (project in file(".")).
  settings(
    name := "taxi",
    version := "1.0"
)

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
)

File: project/assembly.sbt

Only contains the link to the assembly plugin for SBT. Aim: build a fat jar with all of the dependencies.

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

For more info: github.com/sbt/sbt-assembly

Compile

In the root dir of your project, run "sbt assembly". Go for coffee. The very first time this takes quite a while and requires quite a few downloads. When an error occurs, first try and rerun "sbt assembly". It may, or may not help.

Like this:

$ unzip ~/Downloads/sbt_taxi.zip . 
Archive:  sbt_taxi.zip
  inflating: Taxi.scala              
  inflating: project/assembly.sbt    
  inflating: build.sbt               

$ sbt assembly
[info] Loading project definition from /home/wildadm/20160428_scala_sbt3/project
[info] Updating {file:/home/wildadm/20160428_scala_sbt3/project/}root-20160428_scala_sbt3-build...
[info] Resolving org.pantsbuild#jarjar;1.6.0 ...
..
(first time? wait a long while) 
..

If all goes well you end up with this beauty:

target/scala-2.10/taxi.jar

Troubleshooting

When trying to build on one system I kept on getting build errors (due to duplicate classes), until I added this section to the build file. I removed it afterwards.

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {
    case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
    case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
    case PathList("org", "apache", xs @ _*) => MergeStrategy.last
    case PathList("com", "google", xs @ _*) => MergeStrategy.last
    case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
    case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
    case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
    case "about.html" => MergeStrategy.rename
    case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
    case "META-INF/mailcap" => MergeStrategy.last
    case "META-INF/mimetypes.default" => MergeStrategy.last
    case "plugin.properties" => MergeStrategy.last
    case "log4j.properties" => MergeStrategy.last
    case x => old(x)
  }
}

Transfer the taxi.jar resulting from sbt assembly to your server hosting spark.

Launch

Detail

Before you sumbit the spark job, you want to kick off your virtual server monitoring.

1. Collect monitoring data

Logon to your domain zero (dom0) host, ie the hypervisor or mother of your xen virtual images.

2. Submit spark job

Meanwhile on your spark system submit your taxi job on the spark cluster as follows:

For the above you need your freshly created taxi.jar and the location of the NYC taxi-ride csv files on your HDFS cluster.

Check via hadoop's web interface, how your job is faring... In this case that is on node 1: http://nk01:8088/cluster/apps

Plot the load of your Xen cluster

Step 1: data gathering

Note: the 500 is the number of seconds this action takes place. You may need to increase/decrease this value according to the situation.

Once the job has run, then filter the data (note: this greps on 'nk' the names of the nodes)

Step 2: load data into R

Admitted: there should be better way to translate the narrow to wide format, but since we only have a few nodes...

Result

Plot

Conclusion

As you can tell from the chart, after a short burst on the namenode (nk01) the load gets distributed equally over the 4 nodes. In other Spark works like advertised on the box!

And we found out that the yellow cabs in New York city in 2015 clocked 2 billion miles carrying 245 million passengers, spread over 146 million taxi-rides.

Intro