|
Create dataframe from arrays
Heres' how to create dataframes udf and tdf (aka tables t_user and t_transaction) from simple arrays.
Create dataframe udf
Create a dataframe from arrays:
id_v=c(9000, 9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008)
name_v=c('Gerd Abrahamsson', 'Hanna Andersson', 'August Bergsten',
'Arvid Bohlin', 'Edvard Marklund', 'Ragnhild Brännström',
'Börje Wallin', 'Otto Byström','Elise Dahlström')
udf=data.frame( uid=id_v, name=name_v)
Content of udf:
uid name
1 9000 Gerd Abrahamsson
2 9001 Hanna Andersson
3 9002 August Bergsten
4 9003 Arvid Bohlin
5 9004 Edvard Marklund
6 9005 Ragnhild Brännström
7 9006 Börje Wallin
8 9007 Otto Byström
9 9008 Elise Dahlström
Create dataframe tdf
xid_v <- c( 5000, 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5008, 5009, 5010, 5011, 5012,
5013, 5014, 5015, 5016, 5017, 5018, 5019, 5020)
uid_v <- c( 9008, 9003, 9003, 9007, 9004, 9007, 9002, 9008, 9005, 9008, 9006, 9008, 9008,
9005, 9005, 9001, 9000, 9003, 9002, 9001, 9004)
amount_v <- c(498, 268, 621, -401, 720, -492, -153, 272, -250, 82, 549, -571, 814, -114,
819, -404, -95, 428, -549, -462, -339)
date_v <- c('2016-02-21 06:28:49', '2016-01-17 13:37:38', '2016-02-24 15:36:53',
'2016-01-14 16:43:27', '2016-05-14 16:29:54', '2016-02-24 23:58:57',
'2016-02-18 17:58:33', '2016-05-26 12:00:00', '2016-02-24 23:14:52',
'2016-04-20 18:33:25', '2016-02-16 14:37:25', '2016-02-28 13:05:33',
'2016-03-20 13:29:11', '2016-02-06 14:55:10', '2016-01-18 10:50:20',
'2016-02-20 22:08:23', '2016-05-09 10:26:05', '2016-03-27 15:30:47',
'2016-04-15 21:44:49', '2016-03-09 20:32:35', '2016-05-03 17:11:21')
tdf<-data.frame( xid=xid_v, uid=uid_v, amount=amount_v, date=as.POSIXct(date_v) )
Content of tdf:
xid uid amount date
1 5000 9008 498 2016-02-21 06:28:49
2 5001 9003 268 2016-01-17 13:37:38
3 5002 9003 621 2016-02-24 15:36:53
4 5003 9007 -401 2016-01-14 16:43:27
5 5004 9004 720 2016-05-14 16:29:54
6 5005 9007 -492 2016-02-24 23:58:57
7 5006 9002 -153 2016-02-18 17:58:33
8 5007 9008 272 2016-05-26 12:00:00
9 5008 9005 -250 2016-02-24 23:14:52
10 5009 9008 82 2016-04-20 18:33:25
11 5010 9006 549 2016-02-16 14:37:25
12 5011 9008 -571 2016-02-28 13:05:33
13 5012 9008 814 2016-03-20 13:29:11
14 5013 9005 -114 2016-02-06 14:55:10
15 5014 9005 819 2016-01-18 10:50:20
16 5015 9001 -404 2016-02-20 22:08:23
17 5016 9000 -95 2016-05-09 10:26:05
18 5017 9003 428 2016-03-27 15:30:47
19 5018 9002 -549 2016-04-15 21:44:49
20 5019 9001 -462 2016-03-09 20:32:35
21 5020 9004 -339 2016-05-03 17:11:21
Create dataframe udf
Create a dataframe from arrays:
import pandas as pd
uid_v= [9000, 9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008]
name_v=[u'Gerd Abrahamsson', u'Hanna Andersson', u'August Bergsten',
u'Arvid Bohlin', u'Edvard Marklund', u'Ragnhild Br\xe4nnstr\xf6m',
u'B\xf6rje Wallin', u'Otto Bystr\xf6m',u'Elise Dahlstr\xf6m']
udf=pd.DataFrame(uid_v, columns=['uid'])
udf['name']=name_v
Content of udf:
uid name
0 9000 Gerd Abrahamsson
1 9001 Hanna Andersson
2 9002 August Bergsten
3 9003 Arvid Bohlin
4 9004 Edvard Marklund
5 9005 Ragnhild Brännström
6 9006 Börje Wallin
7 9007 Otto Byström
8 9008 Elise Dahlström
Create dataframe tdf
xid_v=[ 5000, 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5008, 5009, 5010, 5011, 5012,
5013, 5014, 5015, 5016, 5017, 5018, 5019, 5020]
uid_v=[ 9008, 9003, 9003, 9007, 9004, 9007, 9002, 9008, 9005, 9008, 9006, 9008, 9008,
9005, 9005, 9001, 9000, 9003, 9002, 9001, 9004]
amount_v= [498, 268, 621, -401, 720, -492, -153, 272, -250, 82, 549, -571, 814, -114,
819, -404, -95, 428, -549, -462, -339]
date_v=['2016-02-21 06:28:49', '2016-01-17 13:37:38', '2016-02-24 15:36:53',
'2016-01-14 16:43:27', '2016-05-14 16:29:54', '2016-02-24 23:58:57',
'2016-02-18 17:58:33', '2016-05-26 12:00:00', '2016-02-24 23:14:52',
'2016-04-20 18:33:25', '2016-02-16 14:37:25', '2016-02-28 13:05:33',
'2016-03-20 13:29:11', '2016-02-06 14:55:10', '2016-01-18 10:50:20',
'2016-02-20 22:08:23', '2016-05-09 10:26:05', '2016-03-27 15:30:47',
'2016-04-15 21:44:49', '2016-03-09 20:32:35', '2016-05-03 17:11:21']
tdf=pd.DataFrame(xid_v, columns=['xid'])
tdf['uid']=uid_v
tdf['amount']=amount_v
tdf['date']=pd.to_datetime(date_v)
Content of tdf:
xid uid amount date
0 5000 9008 498 2016-02-21 06:28:49
1 5001 9003 268 2016-01-17 13:37:38
2 5002 9003 621 2016-02-24 15:36:53
3 5003 9007 -401 2016-01-14 16:43:27
4 5004 9004 720 2016-05-14 16:29:54
5 5005 9007 -492 2016-02-24 23:58:57
6 5006 9002 -153 2016-02-18 17:58:33
7 5007 9008 272 2016-05-26 12:00:00
8 5008 9005 -250 2016-02-24 23:14:52
9 5009 9008 82 2016-04-20 18:33:25
10 5010 9006 549 2016-02-16 14:37:25
11 5011 9008 -571 2016-02-28 13:05:33
12 5012 9008 814 2016-03-20 13:29:11
13 5013 9005 -114 2016-02-06 14:55:10
14 5014 9005 819 2016-01-18 10:50:20
15 5015 9001 -404 2016-02-20 22:08:23
16 5016 9000 -95 2016-05-09 10:26:05
17 5017 9003 428 2016-03-27 15:30:47
18 5018 9002 -549 2016-04-15 21:44:49
19 5019 9001 -462 2016-03-09 20:32:35
20 5020 9004 -339 2016-05-03 17:11:21
Example: select data from a certain date range
from datetime import datetime
startdate=datetime.strptime('2016-02-01','%Y-%m-%d')
enddate=datetime.strptime('2016-02-29','%Y-%m-%d')
tdf[((tdf.txdate >= startdate) & (tdf.txdate <=enddate))]
The easiest way to generate the sql statements is to use python pandas to connect and export the dataframes to a sqlite database, like this:
import sqlite3
con=sqlite3.connect('db.sqlite')
udf.to_sql(name='t_user', con=con, index=False)
tdf.to_sql(name='t_transaction', con=con, index=False)
con.close()
Then issue this command on the CLI :
sqlite3 db.sqlite .dump > create.sql
.. to give you these SQL statements:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE "t_user" (
"uid" INTEGER,
"name" TEXT
);
INSERT INTO "t_user" VALUES(9000,'Gerd Abrahamsson');
INSERT INTO "t_user" VALUES(9001,'Hanna Andersson');
INSERT INTO "t_user" VALUES(9002,'August Bergsten');
INSERT INTO "t_user" VALUES(9003,'Arvid Bohlin');
INSERT INTO "t_user" VALUES(9004,'Edvard Marklund');
INSERT INTO "t_user" VALUES(9005,'Ragnhild Brännström');
INSERT INTO "t_user" VALUES(9006,'Börje Wallin');
INSERT INTO "t_user" VALUES(9007,'Otto Byström');
INSERT INTO "t_user" VALUES(9008,'Elise Dahlström');
CREATE TABLE "t_transaction" (
"xid" INTEGER,
"uid" INTEGER,
"amount" INTEGER,
"date" TIMESTAMP
);
INSERT INTO "t_transaction" VALUES(5000,9008,498,'2016-02-21 06:28:49');
INSERT INTO "t_transaction" VALUES(5001,9003,268,'2016-01-17 13:37:38');
INSERT INTO "t_transaction" VALUES(5002,9003,621,'2016-02-24 15:36:53');
INSERT INTO "t_transaction" VALUES(5003,9007,-401,'2016-01-14 16:43:27');
INSERT INTO "t_transaction" VALUES(5004,9004,720,'2016-05-14 16:29:54');
INSERT INTO "t_transaction" VALUES(5005,9007,-492,'2016-02-24 23:58:57');
INSERT INTO "t_transaction" VALUES(5006,9002,-153,'2016-02-18 17:58:33');
INSERT INTO "t_transaction" VALUES(5007,9008,272,'2016-05-26 12:00:00');
INSERT INTO "t_transaction" VALUES(5008,9005,-250,'2016-02-24 23:14:52');
INSERT INTO "t_transaction" VALUES(5009,9008,82,'2016-04-20 18:33:25');
INSERT INTO "t_transaction" VALUES(5010,9006,549,'2016-02-16 14:37:25');
INSERT INTO "t_transaction" VALUES(5011,9008,-571,'2016-02-28 13:05:33');
INSERT INTO "t_transaction" VALUES(5012,9008,814,'2016-03-20 13:29:11');
INSERT INTO "t_transaction" VALUES(5013,9005,-114,'2016-02-06 14:55:10');
INSERT INTO "t_transaction" VALUES(5014,9005,819,'2016-01-18 10:50:20');
INSERT INTO "t_transaction" VALUES(5015,9001,-404,'2016-02-20 22:08:23');
INSERT INTO "t_transaction" VALUES(5016,9000,-95,'2016-05-09 10:26:05');
INSERT INTO "t_transaction" VALUES(5017,9003,428,'2016-03-27 15:30:47');
INSERT INTO "t_transaction" VALUES(5018,9002,-549,'2016-04-15 21:44:49');
INSERT INTO "t_transaction" VALUES(5019,9001,-462,'2016-03-09 20:32:35');
INSERT INTO "t_transaction" VALUES(5020,9004,-339,'2016-05-03 17:11:21');
COMMIT;
Add these index creation statements for completeness:
CREATE INDEX "ix_t_user_uid" ON "t_user" ("uid");
CREATE INDEX "ix_t_transaction_xid" ON "t_transaction" ("xid");
This is a bit different then prior tab's: instead of creating from columns, we create it from a sequence of rows.
Run the following in the Spark shell:
val udf_df=sx.createDataFrame(Seq(
(9000,"Gerd Abrahamsson"),
(9001,"Hanna Andersson"),
(9002,"August Bergsten"),
(9003,"Arvid Bohlin"),
(9004,"Edvard Marklund"),
(9005,"Ragnhild Brännström"),
(9006,"Börje Wallin"),
(9007,"Otto Byström"),
(9008,"Elise Dahlström"))).toDF( "uid","name")
import java.sql.Timestamp
val date_format=new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val udfToDate=udf[Timestamp,String]{ (s) =>
new java.sql.Timestamp(date_format.parse(s).getTime()) }
val tdf_df=sx.createDataFrame(Seq(
(5000,9008,498,"2016-02-21 06:28:49"),
(5001,9003,268,"2016-01-17 13:37:38"),
(5002,9003,621,"2016-02-24 15:36:53"),
(5003,9007,-401,"2016-01-14 16:43:27"),
(5004,9004,720,"2016-05-14 16:29:54"),
(5005,9007,-492,"2016-02-24 23:58:57"),
(5006,9002,-153,"2016-02-18 17:58:33"),
(5007,9008,272,"2016-05-26 12:00:00"),
(5008,9005,-250,"2016-02-24 23:14:52"),
(5009,9008,82,"2016-04-20 18:33:25"),
(5010,9006,549,"2016-02-16 14:37:25"),
(5011,9008,-571,"2016-02-28 13:05:33"),
(5012,9008,814,"2016-03-20 13:29:11"),
(5013,9005,-114,"2016-02-06 14:55:10"),
(5014,9005,819,"2016-01-18 10:50:20"),
(5015,9001,-404,"2016-02-20 22:08:23"),
(5016,9000,-95,"2016-05-09 10:26:05"),
(5017,9003,428,"2016-03-27 15:30:47"),
(5018,9002,-549,"2016-04-15 21:44:49"),
(5019,9001,-462,"2016-03-09 20:32:35"),
(5020,9004,-339,"2016-05-03 17:11:21"))
).toDF( "xid","uid","amount","date").
.withColumn("date", udfToDate($"date"))
Et voila: the dataframes udf_df and tdf_df.
..
Sidenote: create fake data
The above user and transaction data was created using the following script, which employs the fake-factory package ( 'pip install fake-factory') to generated random data
The python code:
from faker import Factory
import pandas as pd
import random
fake = Factory.create('sv_SE')
# create user data
uid_v=[]
name_v=[]
for i in range(0,9):
uid_v.append(9000+i)
name_v.append(fake.name())
# create transaction data
xid_v=[]
uid_v=[]
amount_v=[]
date_v=[]
sign=[-1,1]
for i in range(0,21):
xid_v.append(5000+i)
amount_v.append(sign[random.randint(0,1)]*random.randint(80,900))
uid_v.append(id_v[random.randint(0,len(id_v)-1)])
date_v.append(str(fake.date_time_this_year()))
| |