|
TL;DR
Reading a large feather file in R produces this error:
Traceback:
1: .Call("feather_coldataFeather", PACKAGE = "feather", feather, indexes)
2: coldataFeather(x, i)
3: `[.feather`(x, )
4: x[]
5: as_data_frame.feather(data)
6: as_data_frame(data)
7: read_feather(filename)
The feather file was produced from a CSV file, using R. It has no problems being read into a Python session (using the feather-format package), but fails being read back into a new R session.
Reading a smaller version of the same file produces no problem.
Environment
Server
Debian linux 8.4
R version:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.0
year 2016
month 05
day 03
svn rev 70573
language R
version.string R version 3.3.0 (2016-05-03)
nickname Supposedly Educational
Python version:
import sys
sys.version
'2.7.9 (default, Mar 1 2015, 12:57:24) \n[GCC 4.9.2]'
Reproduce the error
Download the yellow cab trip sheet data for January 2015 from www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, which is about 1.9G in size.
Read csv and write feather like this in R:
library(feather)
csv_file='yellow_tripdata_2015-01.csv'
feather_file='yellow_tripdata_2015-01.feather'
df<-read.table(csv_file,header=TRUE,sep=",",quote="",stringsAsFactors=F, na.strings = "")
write_feather(df, feather_file)
If you try to read the feather file again in the same R-session, it will work, so exit your R-session and startup a new one, in which you execute:
library(feather)
feather_file='yellow_tripdata_2015-01.feather'
df <- read_feather(feather_file)
Ouput:
*** caught segfault ***
address 0x7f3c405ff010, cause 'memory not mapped'
Traceback:
1: .Call("feather_coldataFeather", PACKAGE = "feather", feather, indexes)
2: coldataFeather(x, i)
3: `[.feather`(x, )
4: x[]
5: as_data_frame.feather(data)
6: as_data_frame(data)
7: read_feather(feather_file)
Here's the output of smaller versions of the same data-file, to see when the error occurs:
This is what's done for N in 1000, 10000, 100000 .. lines of the csv file:
- a file consisting of the first N lines of the original data file is produced
- this csv file is read in R using read.table(), and written out as a feather file
- the feather file is read into python (to see if it works)
- the feather file is read into R (which starts failing as N grows)
Output of the run
------------------------
1001 lines
Converting yellow_tripdata_2015-01-1001.csv -> yellow_tripdata_2015-01-1001.feather
Reading feather file in python <- yellow_tripdata_2015-01-1001.feather
Reading feather file in R <- yellow_tripdata_2015-01-1001.feather
------------------------
10001 lines
Converting yellow_tripdata_2015-01-10001.csv -> yellow_tripdata_2015-01-10001.feather
Reading feather file in python <- yellow_tripdata_2015-01-10001.feather
Reading feather file in R <- yellow_tripdata_2015-01-10001.feather
------------------------
25001 lines
Converting yellow_tripdata_2015-01-25001.csv -> yellow_tripdata_2015-01-25001.feather
Reading feather file in python <- yellow_tripdata_2015-01-25001.feather
Reading feather file in R <- yellow_tripdata_2015-01-25001.feather
------------------------
50001 lines
Converting yellow_tripdata_2015-01-50001.csv -> yellow_tripdata_2015-01-50001.feather
Reading feather file in python <- yellow_tripdata_2015-01-50001.feather
Reading feather file in R <- yellow_tripdata_2015-01-50001.feather
------------------------
75001 lines
Converting yellow_tripdata_2015-01-75001.csv -> yellow_tripdata_2015-01-75001.feather
Reading feather file in python <- yellow_tripdata_2015-01-75001.feather
Reading feather file in R <- yellow_tripdata_2015-01-75001.feather
So far so good, but now one type of error occurs:
------------------------
100001 lines
Converting yellow_tripdata_2015-01-100001.csv -> yellow_tripdata_2015-01-100001.feather
Reading feather file in python <- yellow_tripdata_2015-01-100001.feather
Reading feather file in R <- yellow_tripdata_2015-01-100001.feather
Error in coldataFeather(x, i) :
SET_STRING_ELT() can only be applied to a 'character vector', not a 'NULL'
Calls: system.time ... as_data_frame.feather -> [ -> [.feather -> coldataFeather -> .Call
Execution halted
And after this, the 'memory not mapped' shows up:
------------------------
500001 lines
Converting yellow_tripdata_2015-01-500001.csv -> yellow_tripdata_2015-01-500001.feather
Reading feather file in python <- yellow_tripdata_2015-01-500001.feather
Reading feather file in R <- yellow_tripdata_2015-01-500001.feather
*** caught segfault ***
address 0x7f5c16387010, cause 'memory not mapped'
Traceback:
1: .Call("feather_coldataFeather", PACKAGE = "feather", feather, indexes)
2: coldataFeather(x, i)
3: `[.feather`(x, )
4: x[]
5: as_data_frame.feather(data)
6: as_data_frame(data)
7: read_feather(filename)
8: system.time({ df <- read_feather(filename)})
An irrecoverable exception occurred. R is aborting now ...
./aardvark.sh: line 8: 7730 Segmentation fault $R_EXE -f ./process_feather.R --args $T2 >> $O
------------------------
1000001 lines
Converting yellow_tripdata_2015-01-1000001.csv -> yellow_tripdata_2015-01-1000001.feather
Reading feather file in python <- yellow_tripdata_2015-01-1000001.feather
Reading feather file in R <- yellow_tripdata_2015-01-1000001.feather
*** caught segfault ***
address 0x7ffa6b007010, cause 'memory not mapped'
Traceback:
1: .Call("feather_coldataFeather", PACKAGE = "feather", feather, indexes)
2: coldataFeather(x, i)
3: `[.feather`(x, )
4: x[]
5: as_data_frame.feather(data)
6: as_data_frame(data)
7: read_feather(filename)
8: system.time({ df <- read_feather(filename)})
An irrecoverable exception occurred. R is aborting now ...
./aardvark.sh: line 8: 7763 Segmentation fault $R_EXE -f ./process_feather.R --args $T2 >> $O
------------------------
5000001 lines
Converting yellow_tripdata_2015-01-5000001.csv -> yellow_tripdata_2015-01-5000001.feather
Reading feather file in python <- yellow_tripdata_2015-01-5000001.feather
Reading feather file in R <- yellow_tripdata_2015-01-5000001.feather
*** caught segfault ***
address 0x7f4225031010, cause 'memory not mapped'
Traceback:
1: .Call("feather_coldataFeather", PACKAGE = "feather", feather, indexes)
2: coldataFeather(x, i)
3: `[.feather`(x, )
4: x[]
5: as_data_frame.feather(data)
6: as_data_frame(data)
7: read_feather(filename)
8: system.time({ df <- read_feather(filename)})
An irrecoverable exception occurred. R is aborting now ...
./aardvark.sh: line 8: 7796 Segmentation fault $R_EXE -f ./process_feather.R --args $T2 >> $O
------------------------
10000001 lines
Converting yellow_tripdata_2015-01-10000001.csv -> yellow_tripdata_2015-01-10000001.feather
Reading feather file in python <- yellow_tripdata_2015-01-10000001.feather
Reading feather file in R <- yellow_tripdata_2015-01-10000001.feather
*** caught segfault ***
address 0x7f1b5b25e010, cause 'memory not mapped'
Traceback:
1: .Call("feather_coldataFeather", PACKAGE = "feather", feather, indexes)
2: coldataFeather(x, i)
3: `[.feather`(x, )
4: x[]
5: as_data_frame.feather(data)
6: as_data_frame(data)
7: read_feather(filename)
8: system.time({ df <- read_feather(filename)})
An irrecoverable exception occurred. R is aborting now ...
./aardvark.sh: line 8: 7829 Segmentation fault $R_EXE -f ./process_feather.R --args $T2 >> $O
AARDVARK: ERROR: exit status 139
These are the file-sizes of the feather files:
161K yellow_tripdata_2015-01-1001.feather
1.6M yellow_tripdata_2015-01-10001.feather
3.9M yellow_tripdata_2015-01-25001.feather
7.8M yellow_tripdata_2015-01-50001.feather
12M yellow_tripdata_2015-01-75001.feather
* 16M yellow_tripdata_2015-01-100001.feather
! 78M yellow_tripdata_2015-01-500001.feather
! 156M yellow_tripdata_2015-01-1000001.feather
! 778M yellow_tripdata_2015-01-5000001.feather
! 1.6G yellow_tripdata_2015-01-10000001.feather
A different error occurs at 100001 (*), as of 500001 (!) the reported 'memory not mapped' error starts to occur.
Here's the aardvark.code file. (What's aardvark? See: data.munging.ninja/aardvarkcode ).
Before running aardvark, you need to adapt the '/u01/data/..' and '/u02/data/..' paths to the structure on your machine. And of course don't forget to fetch the data-file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
| ##================================================================================
##== readme.md
# Source for data files
The New York City trip data: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
##================================================================================
##== convert.R
# Convert .csv file to .feather file
library(feather)
args<-commandArgs(T)
if (length(args)<2) {
stop("ERROR: insufficient number of arguments")
}
infile=args[1]
outfile=args[2]
df<-read.table(infile,header=TRUE,sep=",",quote="",stringsAsFactors=F, na.strings = "")
write_feather(df, outfile)
##================================================================================
##== process_feather.R
# read .feather file and sum up the tips
library(feather)
args<-commandArgs(T)
if (length(args)<1) {
stop("ERROR: no datafilename given")
}
filename=args[1]
basename=sub('^.*/','',filename)
dur1 <- system.time({ df <- read_feather(filename) })
dur2 <- system.time({ sum_tips<-sum(df$tip_amount) })
cat( sprintf("r_feather,%d,%f,%f,%f,%s\n" ,
nrow(df),dur1['user.self'], dur2['user.self'], sum_tips,basename))
##================================================================================
##== process_feather.py
# read .feather file and sum up the tips
import pandas as pd
import feather,time,sys,re
# cmd line args
if len(sys.argv)<2:
print "ERROR: no datafilename given"
sys.exit(2)
filename=sys.argv[1]
basename=re.sub('^.*/','',filename)
start=time.time() ; df=feather.read_dataframe(filename) ; dur1=time.time()-start
start=time.time() ; sum_tips=df.tip_amount.sum() ; dur2=time.time()-start
print "py_feather,%d,%f,%f,%f,%s" % ( len(df),dur1, dur2, sum_tips,basename)
##================================================================================
##== aardvark.sh
#!/bin/bash
R_EXE="/usr/bin/R --slave --vanilla --quiet"
PY_EXE="/usr/bin/python2"
S=/u01/data/20160421_nyc_taxi/yellow_tripdata_2015-01.csv
O=out.csv
for N in 1001 10001 25001 50001 75001 100001 500001 1000001 5000001 10000001
do
echo "------------------------"
echo "$N lines"
B=${S##*/} # basename of source file
T1=/u02/data/20160526_nyc_taxi_feather_size/${B%.csv}-$N.csv
# step 1: put N lines of csv data into a file
if [ ! -f $T1 ]
then
echo "Chopping $B to $N lines -> "${T1##*/}
head -$N $S > $T1
fi
# step 2: convert to feather
T2=/u02/data/20160526_nyc_taxi_feather_size/${B%.csv}-$N.feather
echo "Converting "${T1##*/}" -> "${T2##*/}
$R_EXE -f ./convert.R --args "$T1" "$T2"
# step 3: read the feather file in python
echo "Reading feather file in python <- "${T2##*/}
$PY_EXE ./process_feather.py $T2 >> $O
# step 4: read the feather file in R
echo "Reading feather file in R <- "${T2##*/}
$R_EXE -f ./process_feather.R --args $T2 >> $O
done
|
| |