[Rd] readLines() segfaults on large file & question on how to work around

Jennifer Lyon jennifer.s.lyon at gmail.com
Sat Sep 2 20:58:15 CEST 2017


Hi:

I have a 2.1GB JSON file. Typically I use readLines() and
jsonlite:fromJSON() to extract data from a JSON file.

When I try and read in this file using readLines() R segfaults.

I believe the two salient issues with this file are
1). Its size
2). It is a single line (no line breaks)

I can reproduce this issue as follows
#Generate a big file with no line breaks
# In R
> writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")

# in unix shell
cp alpha.txt file.txt
for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
file.txt; done

This generates a 2.3GB file with no line breaks

in R:
> moo <- readLines("file.txt")

 *** caught segfault ***
address 0x7cffffff, cause 'memory not mapped'

Traceback:
 1: readLines("file.txt")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 3

I conclude:
 I am potentially running up against a limit in R, which should give a
reasonable error, but currently just segfaults.

My question:
Most of the content of the JSON is an approximately 100K x 6K JSON
equivalent of a dataframe, and I know R can handle much bigger than this
size. I am expecting these JSON files to get even larger. My R code lives
in a bigger system, and the JSON comes in via stdin, so I have absolutely
no control over the data format. I can imagine trying to incrementally
parse the JSON so I don't bump up against the limit, but I am eager for
suggestions of simpler solutions.

Also, I apologize for the timing of this bug report, as I know folks are
working to get out the next release of R, but like so many things I have no
control over when bugs leap up.

Thanks.

Jen

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: R-3.4.1/lib/libRblas.so
LAPACK:R-3.4.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.1

	[[alternative HTML version deleted]]



More information about the R-devel mailing list