Framed Data
 
02a_data_create
20160601

Create dataframe from arrays

Heres' how to create dataframes udf and tdf (aka tables t_user and t_transaction) from simple arrays.

Create dataframe udf

Create a dataframe from arrays:

id_v=c(9000, 9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008)
name_v=c('Gerd Abrahamsson', 'Hanna Andersson', 'August Bergsten',
      'Arvid Bohlin', 'Edvard Marklund', 'Ragnhild Brännström',
      'Börje Wallin', 'Otto Byström','Elise Dahlström')

udf=data.frame( uid=id_v, name=name_v)

Content of udf:

   uid                name
1 9000    Gerd Abrahamsson
2 9001     Hanna Andersson
3 9002     August Bergsten
4 9003        Arvid Bohlin
5 9004     Edvard Marklund
6 9005 Ragnhild Brännström
7 9006        Börje Wallin
8 9007        Otto Byström
9 9008     Elise Dahlström

Create dataframe tdf

xid_v <- c( 5000, 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5008, 5009, 5010, 5011, 5012,
            5013, 5014, 5015, 5016, 5017, 5018, 5019, 5020)

uid_v <- c( 9008, 9003, 9003, 9007, 9004, 9007, 9002, 9008, 9005, 9008, 9006, 9008, 9008,
            9005, 9005, 9001, 9000, 9003, 9002, 9001, 9004)

amount_v <- c(498, 268, 621, -401, 720, -492, -153, 272, -250, 82, 549, -571, 814, -114,
              819, -404, -95, 428, -549, -462, -339)

date_v <- c('2016-02-21 06:28:49', '2016-01-17 13:37:38', '2016-02-24 15:36:53',
            '2016-01-14 16:43:27', '2016-05-14 16:29:54', '2016-02-24 23:58:57',
            '2016-02-18 17:58:33', '2016-05-26 12:00:00', '2016-02-24 23:14:52',
            '2016-04-20 18:33:25', '2016-02-16 14:37:25', '2016-02-28 13:05:33',
            '2016-03-20 13:29:11', '2016-02-06 14:55:10', '2016-01-18 10:50:20',
            '2016-02-20 22:08:23', '2016-05-09 10:26:05', '2016-03-27 15:30:47',
            '2016-04-15 21:44:49', '2016-03-09 20:32:35', '2016-05-03 17:11:21')

tdf<-data.frame( xid=xid_v, uid=uid_v, amount=amount_v, date=as.POSIXct(date_v) )

Content of tdf:

    xid  uid amount                date
1  5000 9008    498 2016-02-21 06:28:49
2  5001 9003    268 2016-01-17 13:37:38
3  5002 9003    621 2016-02-24 15:36:53
4  5003 9007   -401 2016-01-14 16:43:27
5  5004 9004    720 2016-05-14 16:29:54
6  5005 9007   -492 2016-02-24 23:58:57
7  5006 9002   -153 2016-02-18 17:58:33
8  5007 9008    272 2016-05-26 12:00:00
9  5008 9005   -250 2016-02-24 23:14:52
10 5009 9008     82 2016-04-20 18:33:25
11 5010 9006    549 2016-02-16 14:37:25
12 5011 9008   -571 2016-02-28 13:05:33
13 5012 9008    814 2016-03-20 13:29:11
14 5013 9005   -114 2016-02-06 14:55:10
15 5014 9005    819 2016-01-18 10:50:20
16 5015 9001   -404 2016-02-20 22:08:23
17 5016 9000    -95 2016-05-09 10:26:05
18 5017 9003    428 2016-03-27 15:30:47
19 5018 9002   -549 2016-04-15 21:44:49
20 5019 9001   -462 2016-03-09 20:32:35
21 5020 9004   -339 2016-05-03 17:11:21

Create dataframe udf

Create a dataframe from arrays:

import pandas as pd

uid_v= [9000, 9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008]
name_v=[u'Gerd Abrahamsson', u'Hanna Andersson', u'August Bergsten',
      u'Arvid Bohlin', u'Edvard Marklund', u'Ragnhild Br\xe4nnstr\xf6m',
      u'B\xf6rje Wallin', u'Otto Bystr\xf6m',u'Elise Dahlstr\xf6m']


udf=pd.DataFrame(uid_v, columns=['uid'])
udf['name']=name_v

Content of udf:

    uid                 name
0  9000     Gerd Abrahamsson
1  9001      Hanna Andersson
2  9002      August Bergsten
3  9003         Arvid Bohlin
4  9004      Edvard Marklund
5  9005  Ragnhild Brännström
6  9006         Börje Wallin
7  9007         Otto Byström
8  9008      Elise Dahlström

Create dataframe tdf

xid_v=[ 5000, 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5008, 5009, 5010, 5011, 5012,
        5013, 5014, 5015, 5016, 5017, 5018, 5019, 5020]

uid_v=[ 9008, 9003, 9003, 9007, 9004, 9007, 9002, 9008, 9005, 9008, 9006, 9008, 9008,
        9005, 9005, 9001, 9000, 9003, 9002, 9001, 9004]

amount_v= [498, 268, 621, -401, 720, -492, -153, 272, -250, 82, 549, -571, 814, -114,
        819, -404, -95, 428, -549, -462, -339]

date_v=['2016-02-21 06:28:49', '2016-01-17 13:37:38', '2016-02-24 15:36:53',
        '2016-01-14 16:43:27', '2016-05-14 16:29:54', '2016-02-24 23:58:57',
        '2016-02-18 17:58:33', '2016-05-26 12:00:00', '2016-02-24 23:14:52',
        '2016-04-20 18:33:25', '2016-02-16 14:37:25', '2016-02-28 13:05:33',
        '2016-03-20 13:29:11', '2016-02-06 14:55:10', '2016-01-18 10:50:20',
        '2016-02-20 22:08:23', '2016-05-09 10:26:05', '2016-03-27 15:30:47',
        '2016-04-15 21:44:49', '2016-03-09 20:32:35', '2016-05-03 17:11:21']

tdf=pd.DataFrame(xid_v, columns=['xid'])
tdf['uid']=uid_v
tdf['amount']=amount_v
tdf['date']=pd.to_datetime(date_v)

Content of tdf:

     xid   uid  amount                date
0   5000  9008     498 2016-02-21 06:28:49
1   5001  9003     268 2016-01-17 13:37:38
2   5002  9003     621 2016-02-24 15:36:53
3   5003  9007    -401 2016-01-14 16:43:27
4   5004  9004     720 2016-05-14 16:29:54
5   5005  9007    -492 2016-02-24 23:58:57
6   5006  9002    -153 2016-02-18 17:58:33
7   5007  9008     272 2016-05-26 12:00:00
8   5008  9005    -250 2016-02-24 23:14:52
9   5009  9008      82 2016-04-20 18:33:25
10  5010  9006     549 2016-02-16 14:37:25
11  5011  9008    -571 2016-02-28 13:05:33
12  5012  9008     814 2016-03-20 13:29:11
13  5013  9005    -114 2016-02-06 14:55:10
14  5014  9005     819 2016-01-18 10:50:20
15  5015  9001    -404 2016-02-20 22:08:23
16  5016  9000     -95 2016-05-09 10:26:05
17  5017  9003     428 2016-03-27 15:30:47
18  5018  9002    -549 2016-04-15 21:44:49
19  5019  9001    -462 2016-03-09 20:32:35
20  5020  9004    -339 2016-05-03 17:11:21

Example: select data from a certain date range

from datetime import datetime
startdate=datetime.strptime('2016-02-01','%Y-%m-%d')
enddate=datetime.strptime('2016-02-29','%Y-%m-%d')
tdf[((tdf.txdate >= startdate) & (tdf.txdate <=enddate))]

The easiest way to generate the sql statements is to use python pandas to connect and export the dataframes to a sqlite database, like this:

import sqlite3
con=sqlite3.connect('db.sqlite') 
udf.to_sql(name='t_user', con=con, index=False)
tdf.to_sql(name='t_transaction', con=con, index=False)
con.close()

Then issue this command on the CLI :

sqlite3 db.sqlite .dump > create.sql 

.. to give you these SQL statements:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE "t_user" (
"uid" INTEGER,
  "name" TEXT
);
INSERT INTO "t_user" VALUES(9000,'Gerd Abrahamsson');
INSERT INTO "t_user" VALUES(9001,'Hanna Andersson');
INSERT INTO "t_user" VALUES(9002,'August Bergsten');
INSERT INTO "t_user" VALUES(9003,'Arvid Bohlin');
INSERT INTO "t_user" VALUES(9004,'Edvard Marklund');
INSERT INTO "t_user" VALUES(9005,'Ragnhild Brännström');
INSERT INTO "t_user" VALUES(9006,'Börje Wallin');
INSERT INTO "t_user" VALUES(9007,'Otto Byström');
INSERT INTO "t_user" VALUES(9008,'Elise Dahlström');
CREATE TABLE "t_transaction" (
"xid" INTEGER,
  "uid" INTEGER,
  "amount" INTEGER,
  "date" TIMESTAMP
);
INSERT INTO "t_transaction" VALUES(5000,9008,498,'2016-02-21 06:28:49');
INSERT INTO "t_transaction" VALUES(5001,9003,268,'2016-01-17 13:37:38');
INSERT INTO "t_transaction" VALUES(5002,9003,621,'2016-02-24 15:36:53');
INSERT INTO "t_transaction" VALUES(5003,9007,-401,'2016-01-14 16:43:27');
INSERT INTO "t_transaction" VALUES(5004,9004,720,'2016-05-14 16:29:54');
INSERT INTO "t_transaction" VALUES(5005,9007,-492,'2016-02-24 23:58:57');
INSERT INTO "t_transaction" VALUES(5006,9002,-153,'2016-02-18 17:58:33');
INSERT INTO "t_transaction" VALUES(5007,9008,272,'2016-05-26 12:00:00');
INSERT INTO "t_transaction" VALUES(5008,9005,-250,'2016-02-24 23:14:52');
INSERT INTO "t_transaction" VALUES(5009,9008,82,'2016-04-20 18:33:25');
INSERT INTO "t_transaction" VALUES(5010,9006,549,'2016-02-16 14:37:25');
INSERT INTO "t_transaction" VALUES(5011,9008,-571,'2016-02-28 13:05:33');
INSERT INTO "t_transaction" VALUES(5012,9008,814,'2016-03-20 13:29:11');
INSERT INTO "t_transaction" VALUES(5013,9005,-114,'2016-02-06 14:55:10');
INSERT INTO "t_transaction" VALUES(5014,9005,819,'2016-01-18 10:50:20');
INSERT INTO "t_transaction" VALUES(5015,9001,-404,'2016-02-20 22:08:23');
INSERT INTO "t_transaction" VALUES(5016,9000,-95,'2016-05-09 10:26:05');
INSERT INTO "t_transaction" VALUES(5017,9003,428,'2016-03-27 15:30:47');
INSERT INTO "t_transaction" VALUES(5018,9002,-549,'2016-04-15 21:44:49');
INSERT INTO "t_transaction" VALUES(5019,9001,-462,'2016-03-09 20:32:35');
INSERT INTO "t_transaction" VALUES(5020,9004,-339,'2016-05-03 17:11:21');
COMMIT;

Add these index creation statements for completeness:

CREATE INDEX "ix_t_user_uid" ON "t_user" ("uid");
CREATE INDEX "ix_t_transaction_xid" ON "t_transaction" ("xid");

This is a bit different then prior tab's: instead of creating from columns, we create it from a sequence of rows.

Run the following in the Spark shell:

val udf_df=sx.createDataFrame(Seq(  
    (9000,"Gerd Abrahamsson"),
    (9001,"Hanna Andersson"),
    (9002,"August Bergsten"),
    (9003,"Arvid Bohlin"),
    (9004,"Edvard Marklund"),
    (9005,"Ragnhild Brännström"),
    (9006,"Börje Wallin"),
    (9007,"Otto Byström"),
    (9008,"Elise Dahlström"))).toDF( "uid","name")



import java.sql.Timestamp

val date_format=new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val udfToDate=udf[Timestamp,String]{ (s) => 
        new java.sql.Timestamp(date_format.parse(s).getTime()) }



val tdf_df=sx.createDataFrame(Seq(  
    (5000,9008,498,"2016-02-21 06:28:49"),
    (5001,9003,268,"2016-01-17 13:37:38"),
    (5002,9003,621,"2016-02-24 15:36:53"),
    (5003,9007,-401,"2016-01-14 16:43:27"),
    (5004,9004,720,"2016-05-14 16:29:54"),
    (5005,9007,-492,"2016-02-24 23:58:57"),
    (5006,9002,-153,"2016-02-18 17:58:33"),
    (5007,9008,272,"2016-05-26 12:00:00"),
    (5008,9005,-250,"2016-02-24 23:14:52"),
    (5009,9008,82,"2016-04-20 18:33:25"),
    (5010,9006,549,"2016-02-16 14:37:25"),
    (5011,9008,-571,"2016-02-28 13:05:33"),
    (5012,9008,814,"2016-03-20 13:29:11"),
    (5013,9005,-114,"2016-02-06 14:55:10"),
    (5014,9005,819,"2016-01-18 10:50:20"),
    (5015,9001,-404,"2016-02-20 22:08:23"),
    (5016,9000,-95,"2016-05-09 10:26:05"),
    (5017,9003,428,"2016-03-27 15:30:47"),
    (5018,9002,-549,"2016-04-15 21:44:49"),
    (5019,9001,-462,"2016-03-09 20:32:35"),
    (5020,9004,-339,"2016-05-03 17:11:21"))
  ).toDF( "xid","uid","amount","date").
   .withColumn("date", udfToDate($"date"))

Et voila: the dataframes udf_df and tdf_df.

..

Sidenote: create fake data

The above user and transaction data was created using the following script, which employs the fake-factory package ( 'pip install fake-factory') to generated random data

The python code: 

from faker import Factory
import pandas as pd
import random

fake = Factory.create('sv_SE')

# create user data
uid_v=[]
name_v=[]
for i in range(0,9):
    uid_v.append(9000+i)
    name_v.append(fake.name())

# create transaction data
xid_v=[]
uid_v=[]
amount_v=[]
date_v=[]
sign=[-1,1]
for i in range(0,21):
    xid_v.append(5000+i)
    amount_v.append(sign[random.randint(0,1)]*random.randint(80,900))
    uid_v.append(id_v[random.randint(0,len(id_v)-1)])
    date_v.append(str(fake.date_time_this_year()))
 
Notes by Data Munging Ninja. Generated on nini:sync/20151223_datamungingninja/frameddata at 2016-10-18 07:18