|
From wide to narrow
To make it easier to do the calculations, first we convert the dataframe from wide to narrow, so that we end up with this format:
shop product unit month
------------ --------- ------ -------
superstore bread 341 1
superstore cheese 57 1
superstore razors 360 1
superstore soap 8 1
superstore milk 33 1
superstore nuts 1338 1
megamart bread 371 1
megamart cheese 51 1
megamart razors 599 1
..
..
The naive, elaborate way of doing the conversion from wide to narrow would be to create 12 dataframes like this, and unionall them together:
val d1=sale_df.select("shop","product","jan").
withColumnRenamed("jan","unit").
withColumn("month", lit(1) )
val d2=sale_df.select("shop","product","feb").
withColumnRenamed("feb","unit").
withColumn("month", lit(2) )
..
..
val d12=sale_df.select("shop","product","dec").
withColumnRenamed("dec","unit").
withColumn("month", lit(12) )
df=d1.unionAll(d2).unionAll(d3).unionAll(d4). .. unionAll(d12)
But there is an easier way: do it programmatically, as shown in the following bit of Scala code. This is a lot more elegant, and saves a lot of typing (and typos) :
val months=Seq("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")
var i=0
val df_seq = for ( m <- months ) yield {
i=i+1
sale_df.select("shop","product",m).
withColumnRenamed(m,"unit").
withColumn("month", lit(i) )
}
val df=df_seq.reduce( _ unionAll _ )
The resulting dataframe looks like this:
df.orderBy("shop","product","month").show()
+--------+-------+----+-----+
| shop|product|unit|month|
+--------+-------+----+-----+
|megamart| bread| 371| 1|
|megamart| bread| 432| 2|
|megamart| bread| 425| 3|
|megamart| bread| 524| 4|
|megamart| bread| 468| 5|
|megamart| bread| 414| 6|
|megamart| bread|null| 7|
|megamart| bread| 487| 8|
|megamart| bread| 493| 9|
|megamart| bread| 517| 10|
|megamart| bread| 473| 11|
|megamart| bread| 470| 12|
|megamart| cheese| 51| 1|
|megamart| cheese| 56| 2|
|megamart| cheese| 63| 3|
|megamart| cheese|null| 4|
..
..
To make it easier on our brain to grok the mathematical formula, let's rename the 'unit' column to 'y' and the 'month' column to 'x', like used in the above formula. At the same time, let's already filter out the records where the unit is null, they are of no use to the calculations.
val sale_narrow_df=df.where("unit is not null").
withColumnRenamed("unit","y").
withColumnRenamed("month","x").
select("shop","product","x","y")
Narrow dataframe: sale_narrow_df
Completing the conversion from wide to narrow, we end up with this dataframe, on which all the following calculations will be based.
sale_narrow_df.orderBy("shop","product","x").show()
+--------+-------+---+---+
| shop|product| x| y|
+--------+-------+---+---+
|megamart| bread| 1|371|
|megamart| bread| 2|432|
|megamart| bread| 3|425|
|megamart| bread| 4|524|
|megamart| bread| 5|468|
|megamart| bread| 6|414|
|megamart| bread| 8|487|
|megamart| bread| 9|493|
|megamart| bread| 10|517|
|megamart| bread| 11|473|
|megamart| bread| 12|470|
|megamart| cheese| 1| 51|
|megamart| cheese| 2| 56|
|megamart| cheese| 3| 63|
|megamart| cheese| 5| 66|
|megamart| cheese| 6| 66|
..
..
| |