Simple Sales Prediction

99_appendix
20160616

# Creation of random data

This R script was used to generated the random sales data.

``````
set.seed(29.02)

# per product multiplication factor
z0=c( 10, 1, 8, 0.3, 0.5, 20)
z=c(z0,z0)

m<-matrix(0,12,12)
for ( i in 1:12) {
mu=3*(8+round(rnorm(1,10,4))) * z[i]
m[i,]=matrix(round(rnorm(12,mu,mu/10),0),1, 12)
}
df<- cbind( data.frame( cbind( rep(c("superstore", "megamart"),1,each=6),
data.frame(m) )

months=c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")
colnames(df)= c("shop","product", months)

# punch some holes! -> ie. fill some positions with NA
num_na=15
c=2+sample(12,num_na,replace=T)
r=sample(12,num_na,replace=T)
for (i in 1:num_na ) { df[r[i],c[i]]=NA }

write.table(df, "sales.csv", row.names=F, col.names=T, sep=",", na="null", quote=T)

``````

The output, in file "sales.csv", is copy/pasted and massaged a bit, and then fitted into the following scala script.

## Create the sale_df dataframe

``````import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import scala.collection.JavaConversions._   // IMPORTANT !!

val sx= new org.apache.spark.sql.SQLContext(sc)

val sale_schema = StructType(
StructField("shop",StringType,false)  ::
StructField("product",StringType,false)  ::
StructField("jan",IntegerType,true)  ::
StructField("feb",IntegerType,true)  ::
StructField("mar",IntegerType,true)  ::
StructField("apr",IntegerType,true)  ::
StructField("may",IntegerType,true)  ::
StructField("jun",IntegerType,true)  ::
StructField("jul",IntegerType,true)  ::
StructField("aug",IntegerType,true)  ::
StructField("sep",IntegerType,true)  ::
StructField("oct",IntegerType,true)  ::
StructField("nov",IntegerType,true)  ::
StructField("dec",IntegerType,true) :: Nil
)

val sale_df=sx.createDataFrame( new java.util.ArrayList[org.apache.spark.sql.Row]( Seq(
Row("superstore","cheese",57,52,null,54,62,null,56,66,46,63,55,53),
Row("superstore","razors",360,362,366,352,365,361,361,353,317,335,290,406),
Row("superstore","soap",8,8,7,8,6,null,7,7,7,8,6,null),
Row("superstore","milk",33,null,null,33,30,36,35,34,38,32,35,29),
Row("superstore","nuts",1338,1369,1157,1305,1532,1231,1466,1148,1298,1059,1216,1231),
Row("megamart","cheese",51,56,63,null,66,66,50,56,58,null,48,50),
Row("megamart","razors",599,null,500,423,574,403,609,520,495,577,491,524),
Row("megamart","soap",null,7,8,9,9,8,9,9,9,6,6,8),
Row("megamart","milk",null,29,26,30,26,29,29,25,27,null,28,30),
Row("megamart","nuts",1342,1264,1317,1425,1326,1187,1478,1367,1274,1380,1584,1156))) , sale_schema)

``````

After running the above scala script in the Spark Shell you have the same "sale_df" used in above text.

## Sidenote

In some texts you see dataframes created without defining a schema, like this:

``````val sale_df=sx.createDataFrame(Seq(
("superstore","cheese",60,57,48,56,89,54,66,64,55,61,59,65),
..
..
("superstore","nuts",37,40,39,38,37,41,42,42,39,38,42,46))).toDF(
"shop","product","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec")``````

That is also possible, but won't work in the above case, because we have NULLs in our data.

# The aardvark.code file

The scala and R code for creating, manipulating and plotting the data in 1 neat aardvark.code file.

aardvark.code

Notes by Data Munging Ninja. Generated on nini:/home/willem/sync/20151223_datamungingninja/simplesalesprediction at 2016-06-25 10:02