Simple Sales Prediction
 
02_conversion
20160616

From wide to narrow

To make it easier to do the calculations, first we convert the dataframe from wide to narrow, so that we end up with this format:

         shop   product   unit   month   
 ------------ --------- ------ ------- 
   superstore     bread    341       1   
   superstore    cheese     57       1   
   superstore    razors    360       1   
   superstore      soap      8       1   
   superstore      milk     33       1   
   superstore      nuts   1338       1   
     megamart     bread    371       1   
     megamart    cheese     51       1   
     megamart    razors    599       1   
           ..
           ..

The naive, elaborate way of doing the conversion from wide to narrow would be to create 12 dataframes like this, and unionall them together:

     val d1=sale_df.select("shop","product","jan").
                    withColumnRenamed("jan","unit").
                    withColumn("month", lit(1) )

     val d2=sale_df.select("shop","product","feb").
                    withColumnRenamed("feb","unit").
                    withColumn("month", lit(2) )
     ..
     ..
     val d12=sale_df.select("shop","product","dec").
                    withColumnRenamed("dec","unit").
                    withColumn("month", lit(12) )

     df=d1.unionAll(d2).unionAll(d3).unionAll(d4). .. unionAll(d12) 

But there is an easier way: do it programmatically, as shown in the following bit of Scala code. This is a lot more elegant, and saves a lot of typing (and typos) :

    val months=Seq("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")

    var i=0
    val df_seq = for (  m <- months ) yield { 
                        i=i+1  
                        sale_df.select("shop","product",m).
                            withColumnRenamed(m,"unit").
                            withColumn("month", lit(i) )
                    }
    val df=df_seq.reduce( _ unionAll _ ) 

The resulting dataframe looks like this:

df.orderBy("shop","product","month").show()

+--------+-------+----+-----+
|    shop|product|unit|month|
+--------+-------+----+-----+
|megamart|  bread| 371|    1|
|megamart|  bread| 432|    2|
|megamart|  bread| 425|    3|
|megamart|  bread| 524|    4|
|megamart|  bread| 468|    5|
|megamart|  bread| 414|    6|
|megamart|  bread|null|    7|
|megamart|  bread| 487|    8|
|megamart|  bread| 493|    9|
|megamart|  bread| 517|   10|
|megamart|  bread| 473|   11|
|megamart|  bread| 470|   12|
|megamart| cheese|  51|    1|
|megamart| cheese|  56|    2|
|megamart| cheese|  63|    3|
|megamart| cheese|null|    4|
..
..

To make it easier on our brain to grok the mathematical formula, let's rename the 'unit' column to 'y' and the 'month' column to 'x', like used in the above formula. At the same time, let's already filter out the records where the unit is null, they are of no use to the calculations.

    val sale_narrow_df=df.where("unit is not null").
                          withColumnRenamed("unit","y").
                          withColumnRenamed("month","x").
                          select("shop","product","x","y") 

Narrow dataframe: sale_narrow_df

Completing the conversion from wide to narrow, we end up with this dataframe, on which all the following calculations will be based.

sale_narrow_df.orderBy("shop","product","x").show()

+--------+-------+---+---+
|    shop|product|  x|  y|
+--------+-------+---+---+
|megamart|  bread|  1|371|
|megamart|  bread|  2|432|
|megamart|  bread|  3|425|
|megamart|  bread|  4|524|
|megamart|  bread|  5|468|
|megamart|  bread|  6|414|
|megamart|  bread|  8|487|
|megamart|  bread|  9|493|
|megamart|  bread| 10|517|
|megamart|  bread| 11|473|
|megamart|  bread| 12|470|
|megamart| cheese|  1| 51|
|megamart| cheese|  2| 56|
|megamart| cheese|  3| 63|
|megamart| cheese|  5| 66|
|megamart| cheese|  6| 66|
..
..
 
Notes by Data Munging Ninja. Generated on nini:/home/willem/sync/20151223_datamungingninja/simplesalesprediction at 2016-06-25 10:02