Simple Sales Prediction
 
99_appendix
20160616

Creation of random data

This R script was used to generated the random sales data.


set.seed(29.02) 

# per product multiplication factor
z0=c( 10, 1, 8, 0.3, 0.5, 20)
z=c(z0,z0)

m<-matrix(0,12,12)
for ( i in 1:12) { 
    mu=3*(8+round(rnorm(1,10,4))) * z[i]    
    m[i,]=matrix(round(rnorm(12,mu,mu/10),0),1, 12)
} 
df<- cbind( data.frame( cbind( rep(c("superstore", "megamart"),1,each=6),
                               c("bread","cheese","razors","soap","milk","nuts")) ),
            data.frame(m) )

months=c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")
colnames(df)= c("shop","product", months)

# punch some holes! -> ie. fill some positions with NA
num_na=15
c=2+sample(12,num_na,replace=T)
r=sample(12,num_na,replace=T)
for (i in 1:num_na ) { df[r[i],c[i]]=NA }

write.table(df, "sales.csv", row.names=F, col.names=T, sep=",", na="null", quote=T)


The output, in file "sales.csv", is copy/pasted and massaged a bit, and then fitted into the following scala script.

Create the sale_df dataframe

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import scala.collection.JavaConversions._   // IMPORTANT !! 

val sx= new org.apache.spark.sql.SQLContext(sc)

val sale_schema = StructType(
      StructField("shop",StringType,false)  ::
      StructField("product",StringType,false)  ::
      StructField("jan",IntegerType,true)  ::
      StructField("feb",IntegerType,true)  ::
      StructField("mar",IntegerType,true)  ::
      StructField("apr",IntegerType,true)  ::
      StructField("may",IntegerType,true)  ::
      StructField("jun",IntegerType,true)  ::
      StructField("jul",IntegerType,true)  ::
      StructField("aug",IntegerType,true)  ::
      StructField("sep",IntegerType,true)  ::
      StructField("oct",IntegerType,true)  ::
      StructField("nov",IntegerType,true)  ::
      StructField("dec",IntegerType,true) :: Nil
    )

val sale_df=sx.createDataFrame( new java.util.ArrayList[org.apache.spark.sql.Row]( Seq( 
    Row("superstore","bread",341,398,427,344,472,370,354,406,null,407,465,402),
    Row("superstore","cheese",57,52,null,54,62,null,56,66,46,63,55,53),
    Row("superstore","razors",360,362,366,352,365,361,361,353,317,335,290,406),
    Row("superstore","soap",8,8,7,8,6,null,7,7,7,8,6,null),
    Row("superstore","milk",33,null,null,33,30,36,35,34,38,32,35,29),
    Row("superstore","nuts",1338,1369,1157,1305,1532,1231,1466,1148,1298,1059,1216,1231),
    Row("megamart","bread",371,432,425,524,468,414,null,487,493,517,473,470),
    Row("megamart","cheese",51,56,63,null,66,66,50,56,58,null,48,50),
    Row("megamart","razors",599,null,500,423,574,403,609,520,495,577,491,524),
    Row("megamart","soap",null,7,8,9,9,8,9,9,9,6,6,8),
    Row("megamart","milk",null,29,26,30,26,29,29,25,27,null,28,30),
    Row("megamart","nuts",1342,1264,1317,1425,1326,1187,1478,1367,1274,1380,1584,1156))) , sale_schema) 

After running the above scala script in the Spark Shell you have the same "sale_df" used in above text.

Sidenote

In some texts you see dataframes created without defining a schema, like this:

val sale_df=sx.createDataFrame(Seq(  
("superstore","bread",53,57,58,57,53,57,56,57,53,53,52,62),
("superstore","cheese",60,57,48,56,89,54,66,64,55,61,59,65),
..
..
("superstore","nuts",37,40,39,38,37,41,42,42,39,38,42,46))).toDF(
    "shop","product","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")

That is also possible, but won't work in the above case, because we have NULLs in our data.

The aardvark.code file

The scala and R code for creating, manipulating and plotting the data in 1 neat aardvark.code file.

aardvark.code

 
Notes by Data Munging Ninja. Generated on nini:/home/willem/sync/20151223_datamungingninja/simplesalesprediction at 2016-06-25 10:02