[Rd] readLines() segfaults on large file & question on how to work around

Jennifer Lyon jennifer.s.lyon at gmail.com
Sat Sep 2 23:15:35 CEST 2017


Thank you for your suggestion. Unfortunately, while R doesn't segfault
calling readr::read_file() on the test file I described, I get the error
message:

Error in read_file_(ds, locale) : negative length vectors are not allowed

Jen

On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn <istazahn at gmail.com> wrote:

> As s work-around I  suggest readr::read_file.
>
> --Ista
>
>
> On Sep 2, 2017 2:58 PM, "Jennifer Lyon" <jennifer.s.lyon at gmail.com> wrote:
>
>> Hi:
>>
>> I have a 2.1GB JSON file. Typically I use readLines() and
>> jsonlite:fromJSON() to extract data from a JSON file.
>>
>> When I try and read in this file using readLines() R segfaults.
>>
>> I believe the two salient issues with this file are
>> 1). Its size
>> 2). It is a single line (no line breaks)
>>
>> I can reproduce this issue as follows
>> #Generate a big file with no line breaks
>> # In R
>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>>
>> # in unix shell
>> cp alpha.txt file.txt
>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
>> file.txt; done
>>
>> This generates a 2.3GB file with no line breaks
>>
>> in R:
>> > moo <- readLines("file.txt")
>>
>>  *** caught segfault ***
>> address 0x7cffffff, cause 'memory not mapped'
>>
>> Traceback:
>>  1: readLines("file.txt")
>>
>> Possible actions:
>> 1: abort (with core dump, if enabled)
>> 2: normal R exit
>> 3: exit R without saving workspace
>> 4: exit R saving workspace
>> Selection: 3
>>
>> I conclude:
>>  I am potentially running up against a limit in R, which should give a
>> reasonable error, but currently just segfaults.
>>
>> My question:
>> Most of the content of the JSON is an approximately 100K x 6K JSON
>> equivalent of a dataframe, and I know R can handle much bigger than this
>> size. I am expecting these JSON files to get even larger. My R code lives
>> in a bigger system, and the JSON comes in via stdin, so I have absolutely
>> no control over the data format. I can imagine trying to incrementally
>> parse the JSON so I don't bump up against the limit, but I am eager for
>> suggestions of simpler solutions.
>>
>> Also, I apologize for the timing of this bug report, as I know folks are
>> working to get out the next release of R, but like so many things I have
>> no
>> control over when bugs leap up.
>>
>> Thanks.
>>
>> Jen
>>
>> > sessionInfo()
>> R version 3.4.1 (2017-06-30)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.5 LTS
>>
>> Matrix products: default
>> BLAS: R-3.4.1/lib/libRblas.so
>> LAPACK:R-3.4.1/lib/libRlapack.so
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_3.4.1
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list