Simple Sales Prediction
Intro
You have a years worth of sales data for 2 shops, for 6 products. Use simple linear regression to predict the sales for the next month. Tool to use: Apache Spark dataframes .
You receive the data in this 'wide' format, and beware: not all of the cells have data! Spot the 'nulls'.
scala> sale_df.orderBy("shop","product").show()
|----------+-------+----+----+----+----+----+----+----+----+----+----+----+----+
| shop|product| jan| feb| mar| apr| may| jun| jul| aug| sep| oct| nov| dec|
|----------+-------+----+----+----+----+----+----+----+----+----+----+----+----+
| megamart| bread| 371| 432| 425| 524| 468| 414|null| 487| 493| 517| 473| 470|
| megamart| cheese| 51| 56| 63|null| 66| 66| 50| 56| 58|null| 48| 50|
| megamart| milk|null| 29| 26| 30| 26| 29| 29| 25| 27|null| 28| 30|
| megamart| nuts|1342|1264|1317|1425|1326|1187|1478|1367|1274|1380|1584|1156|
| megamart| razors| 599|null| 500| 423| 574| 403| 609| 520| 495| 577| 491| 524|
| megamart| soap|null| 7| 8| 9| 9| 8| 9| 9| 9| 6| 6| 8|
|superstore| bread| 341| 398| 427| 344| 472| 370| 354| 406|null| 407| 465| 402|
|superstore| cheese| 57| 52|null| 54| 62|null| 56| 66| 46| 63| 55| 53|
|superstore| milk| 33|null|null| 33| 30| 36| 35| 34| 38| 32| 35| 29|
|superstore| nuts|1338|1369|1157|1305|1532|1231|1466|1148|1298|1059|1216|1231|
|superstore| razors| 360| 362| 366| 352| 365| 361| 361| 353| 317| 335| 290| 406|
|superstore| soap| 8| 8| 7| 8| 6|null| 7| 7| 7| 8| 6|null|
|----------+-------+----+----+----+----+----+----+----+----+----+----+----+----+
(in the appendix of this article, you'll find the Scala code that creates this dataframe)
All the data manipulation is done in Spark Dataframes.
These dataframe functions are used:
groupBy(..).agg( sum(..), avg(..) )
withColumn()
withColumnRenamed()
join()
drop()
select()
, ..
Here's the formula to calculate the coefficients for the simple linear regression, picked up from article Simple Linear Regression :