[Rd] readLines() segfaults on large file & question on how to work around

Ista Zahn istazahn at gmail.com
Sat Sep 2 21:38:16 CEST 2017


As s work-around I  suggest readr::read_file.

--Ista


On Sep 2, 2017 2:58 PM, "Jennifer Lyon" <jennifer.s.lyon at gmail.com> wrote:

> Hi:
>
> I have a 2.1GB JSON file. Typically I use readLines() and
> jsonlite:fromJSON() to extract data from a JSON file.
>
> When I try and read in this file using readLines() R segfaults.
>
> I believe the two salient issues with this file are
> 1). Its size
> 2). It is a single line (no line breaks)
>
> I can reproduce this issue as follows
> #Generate a big file with no line breaks
> # In R
> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>
> # in unix shell
> cp alpha.txt file.txt
> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
> file.txt; done
>
> This generates a 2.3GB file with no line breaks
>
> in R:
> > moo <- readLines("file.txt")
>
>  *** caught segfault ***
> address 0x7cffffff, cause 'memory not mapped'
>
> Traceback:
>  1: readLines("file.txt")
>
> Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace
> Selection: 3
>
> I conclude:
>  I am potentially running up against a limit in R, which should give a
> reasonable error, but currently just segfaults.
>
> My question:
> Most of the content of the JSON is an approximately 100K x 6K JSON
> equivalent of a dataframe, and I know R can handle much bigger than this
> size. I am expecting these JSON files to get even larger. My R code lives
> in a bigger system, and the JSON comes in via stdin, so I have absolutely
> no control over the data format. I can imagine trying to incrementally
> parse the JSON so I don't bump up against the limit, but I am eager for
> suggestions of simpler solutions.
>
> Also, I apologize for the timing of this bug report, as I know folks are
> working to get out the next release of R, but like so many things I have no
> control over when bugs leap up.
>
> Thanks.
>
> Jen
>
> > sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>
> Matrix products: default
> BLAS: R-3.4.1/lib/libRblas.so
> LAPACK:R-3.4.1/lib/libRlapack.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.1
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list