TL;DR

Reading a large feather file in R produces this error:

Traceback:
 1: .Call("feather_coldataFeather", PACKAGE = "feather", feather,     indexes)
 2: coldataFeather(x, i)
 3: `[.feather`(x, )
 4: x[]
 5: as_data_frame.feather(data)
 6: as_data_frame(data)
 7: read_feather(filename)

The feather file was produced from a CSV file, using R. It has no problems being read into a Python session (using the feather-format package), but fails being read back into a new R session.

Reading a smaller version of the same file produces no problem.

Environment

Server

Debian linux 8.4

R version:

platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          3.0                         
year           2016                        
month          05                          
day            03                          
svn rev        70573                       
language       R                           
version.string R version 3.3.0 (2016-05-03)
nickname       Supposedly Educational

Python version:

import sys
sys.version

'2.7.9 (default, Mar  1 2015, 12:57:24) \n[GCC 4.9.2]'

02_reproduce

20160525

Reproduce the error

Download the yellow cab trip sheet data for January 2015 from www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, which is about 1.9G in size.

Read csv and write feather like this in R:

library(feather)
csv_file='yellow_tripdata_2015-01.csv'
feather_file='yellow_tripdata_2015-01.feather'
df<-read.table(csv_file,header=TRUE,sep=",",quote="",stringsAsFactors=F, na.strings = "")
write_feather(df, feather_file)

If you try to read the feather file again in the same R-session, it will work, so exit your R-session and startup a new one, in which you execute:

library(feather)
feather_file='yellow_tripdata_2015-01.feather'
df <- read_feather(feather_file)

Ouput:

 *** caught segfault ***
address 0x7f3c405ff010, cause 'memory not mapped'

Traceback:
 1: .Call("feather_coldataFeather", PACKAGE = "feather", feather,     indexes)
 2: coldataFeather(x, i)
 3: `[.feather`(x, )
 4: x[]
 5: as_data_frame.feather(data)
 6: as_data_frame(data)
 7: read_feather(feather_file)

03_detail

20160525

Here's the output of smaller versions of the same data-file, to see when the error occurs:

This is what's done for N in 1000, 10000, 100000 .. lines of the csv file:

a file consisting of the first N lines of the original data file is produced
this csv file is read in R using read.table(), and written out as a feather file
the feather file is read into python (to see if it works)
the feather file is read into R (which starts failing as N grows)

Output of the run

------------------------
1001 lines
Converting yellow_tripdata_2015-01-1001.csv -> yellow_tripdata_2015-01-1001.feather
Reading feather file in python <- yellow_tripdata_2015-01-1001.feather
Reading feather file in R <- yellow_tripdata_2015-01-1001.feather
------------------------
10001 lines
Converting yellow_tripdata_2015-01-10001.csv -> yellow_tripdata_2015-01-10001.feather
Reading feather file in python <- yellow_tripdata_2015-01-10001.feather
Reading feather file in R <- yellow_tripdata_2015-01-10001.feather
------------------------
25001 lines
Converting yellow_tripdata_2015-01-25001.csv -> yellow_tripdata_2015-01-25001.feather
Reading feather file in python <- yellow_tripdata_2015-01-25001.feather
Reading feather file in R <- yellow_tripdata_2015-01-25001.feather
------------------------
50001 lines
Converting yellow_tripdata_2015-01-50001.csv -> yellow_tripdata_2015-01-50001.feather
Reading feather file in python <- yellow_tripdata_2015-01-50001.feather
Reading feather file in R <- yellow_tripdata_2015-01-50001.feather
------------------------
75001 lines
Converting yellow_tripdata_2015-01-75001.csv -> yellow_tripdata_2015-01-75001.feather
Reading feather file in python <- yellow_tripdata_2015-01-75001.feather
Reading feather file in R <- yellow_tripdata_2015-01-75001.feather

So far so good, but now one type of error occurs:

------------------------
100001 lines
Converting yellow_tripdata_2015-01-100001.csv -> yellow_tripdata_2015-01-100001.feather
Reading feather file in python <- yellow_tripdata_2015-01-100001.feather
Reading feather file in R <- yellow_tripdata_2015-01-100001.feather
Error in coldataFeather(x, i) : 
  SET_STRING_ELT() can only be applied to a 'character vector', not a 'NULL'
Calls: system.time ... as_data_frame.feather -> [ -> [.feather -> coldataFeather -> .Call
Execution halted

And after this, the 'memory not mapped' shows up:

------------------------
500001 lines
Converting yellow_tripdata_2015-01-500001.csv -> yellow_tripdata_2015-01-500001.feather
Reading feather file in python <- yellow_tripdata_2015-01-500001.feather
Reading feather file in R <- yellow_tripdata_2015-01-500001.feather

 *** caught segfault ***
address 0x7f5c16387010, cause 'memory not mapped'

Traceback:
 1: .Call("feather_coldataFeather", PACKAGE = "feather", feather,     indexes)
 2: coldataFeather(x, i)
 3: `[.feather`(x, )
 4: x[]
 5: as_data_frame.feather(data)
 6: as_data_frame(data)
 7: read_feather(filename)
 8: system.time({    df <- read_feather(filename)})
An irrecoverable exception occurred. R is aborting now ...
./aardvark.sh: line 8:  7730 Segmentation fault      $R_EXE -f ./process_feather.R --args $T2 >> $O


------------------------
1000001 lines
Converting yellow_tripdata_2015-01-1000001.csv -> yellow_tripdata_2015-01-1000001.feather
Reading feather file in python <- yellow_tripdata_2015-01-1000001.feather
Reading feather file in R <- yellow_tripdata_2015-01-1000001.feather

 *** caught segfault ***
address 0x7ffa6b007010, cause 'memory not mapped'

Traceback:
 1: .Call("feather_coldataFeather", PACKAGE = "feather", feather,     indexes)
 2: coldataFeather(x, i)
 3: `[.feather`(x, )
 4: x[]
 5: as_data_frame.feather(data)
 6: as_data_frame(data)
 7: read_feather(filename)
 8: system.time({    df <- read_feather(filename)})
An irrecoverable exception occurred. R is aborting now ...
./aardvark.sh: line 8:  7763 Segmentation fault      $R_EXE -f ./process_feather.R --args $T2 >> $O
------------------------
5000001 lines
Converting yellow_tripdata_2015-01-5000001.csv -> yellow_tripdata_2015-01-5000001.feather
Reading feather file in python <- yellow_tripdata_2015-01-5000001.feather
Reading feather file in R <- yellow_tripdata_2015-01-5000001.feather

 *** caught segfault ***
address 0x7f4225031010, cause 'memory not mapped'

Traceback:
 1: .Call("feather_coldataFeather", PACKAGE = "feather", feather,     indexes)
 2: coldataFeather(x, i)
 3: `[.feather`(x, )
 4: x[]
 5: as_data_frame.feather(data)
 6: as_data_frame(data)
 7: read_feather(filename)
 8: system.time({    df <- read_feather(filename)})
An irrecoverable exception occurred. R is aborting now ...
./aardvark.sh: line 8:  7796 Segmentation fault      $R_EXE -f ./process_feather.R --args $T2 >> $O
------------------------
10000001 lines
Converting yellow_tripdata_2015-01-10000001.csv -> yellow_tripdata_2015-01-10000001.feather
Reading feather file in python <- yellow_tripdata_2015-01-10000001.feather
Reading feather file in R <- yellow_tripdata_2015-01-10000001.feather

 *** caught segfault ***
address 0x7f1b5b25e010, cause 'memory not mapped'

Traceback:
 1: .Call("feather_coldataFeather", PACKAGE = "feather", feather,     indexes)
 2: coldataFeather(x, i)
 3: `[.feather`(x, )
 4: x[]
 5: as_data_frame.feather(data)
 6: as_data_frame(data)
 7: read_feather(filename)
 8: system.time({    df <- read_feather(filename)})
An irrecoverable exception occurred. R is aborting now ...
./aardvark.sh: line 8:  7829 Segmentation fault      $R_EXE -f ./process_feather.R --args $T2 >> $O
AARDVARK: ERROR: exit status 139

These are the file-sizes of the feather files:

   161K  yellow_tripdata_2015-01-1001.feather
   1.6M  yellow_tripdata_2015-01-10001.feather
   3.9M  yellow_tripdata_2015-01-25001.feather
   7.8M  yellow_tripdata_2015-01-50001.feather
    12M  yellow_tripdata_2015-01-75001.feather
 *  16M  yellow_tripdata_2015-01-100001.feather
 !  78M  yellow_tripdata_2015-01-500001.feather
 ! 156M  yellow_tripdata_2015-01-1000001.feather
 ! 778M  yellow_tripdata_2015-01-5000001.feather
 ! 1.6G  yellow_tripdata_2015-01-10000001.feather

A different error occurs at 100001 (*), as of 500001 (!) the reported 'memory not mapped' error starts to occur.

04_code

20160525

Here's the aardvark.code file. (What's aardvark? See: data.munging.ninja/aardvarkcode ).

Before running aardvark, you need to adapt the '/u01/data/..' and '/u02/data/..' paths to the structure on your machine. And of course don't forget to fetch the data-file.

##================================================================================
##== readme.md 

# Source for data files

The New York City trip data: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml


##================================================================================
##== convert.R
# Convert .csv file to .feather file

library(feather) 

args<-commandArgs(T)
if (length(args)<2) {
    stop("ERROR: insufficient number of arguments") 
}
infile=args[1]
outfile=args[2]
df<-read.table(infile,header=TRUE,sep=",",quote="",stringsAsFactors=F, na.strings = "")
write_feather(df, outfile)

##================================================================================
##== process_feather.R
# read .feather file and sum up the tips

library(feather) 

args<-commandArgs(T)
if (length(args)<1) {
    stop("ERROR: no datafilename given") 
}

filename=args[1]
basename=sub('^.*/','',filename)
dur1 <- system.time({ df <- read_feather(filename) })
dur2 <- system.time({ sum_tips<-sum(df$tip_amount) })
cat( sprintf("r_feather,%d,%f,%f,%f,%s\n" , 
        nrow(df),dur1['user.self'], dur2['user.self'], sum_tips,basename))

##================================================================================
##== process_feather.py 
# read .feather file and sum up the tips

import pandas as pd
import feather,time,sys,re

# cmd line args
if len(sys.argv)<2:
  print "ERROR: no datafilename given"
  sys.exit(2)

filename=sys.argv[1]
basename=re.sub('^.*/','',filename)
start=time.time() ; df=feather.read_dataframe(filename) ; dur1=time.time()-start
start=time.time() ; sum_tips=df.tip_amount.sum() ; dur2=time.time()-start

print "py_feather,%d,%f,%f,%f,%s" % ( len(df),dur1, dur2, sum_tips,basename)

##================================================================================
##== aardvark.sh  
#!/bin/bash
R_EXE="/usr/bin/R --slave --vanilla --quiet"
PY_EXE="/usr/bin/python2"

S=/u01/data/20160421_nyc_taxi/yellow_tripdata_2015-01.csv
O=out.csv

for N in  1001 10001 25001 50001 75001 100001 500001 1000001 5000001 10000001 
do
    echo "------------------------"
    echo "$N lines" 

    B=${S##*/}          # basename of source file
    T1=/u02/data/20160526_nyc_taxi_feather_size/${B%.csv}-$N.csv

    # step 1: put N lines of csv data into a file
    if [ ! -f $T1 ]
    then
        echo "Chopping $B to $N lines -> "${T1##*/}
        head -$N $S > $T1
    fi
    
    # step 2: convert to feather
    T2=/u02/data/20160526_nyc_taxi_feather_size/${B%.csv}-$N.feather 
    echo "Converting "${T1##*/}" -> "${T2##*/}
    $R_EXE -f ./convert.R --args "$T1" "$T2"
    
    # step 3: read the feather file in python
    echo "Reading feather file in python <- "${T2##*/}
    $PY_EXE ./process_feather.py $T2 >> $O

    # step 4: read the feather file in R
    echo "Reading feather file in R <- "${T2##*/}
    $R_EXE -f ./process_feather.R --args $T2 >> $O
done

Notes by Data Munging Ninja. Generated on nini:/home/willem/sync/20151223_datamungingninja/pubcoms/20160526_bug_feather_mem_not_mapped at 2016-06-01 11:31