[R] readLines without skipNul=TRUE causes crash
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Sat Jul 15 22:14:14 CEST 2017
I am not able to reproduce this on a Linux platform:
#######################3
fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt"
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.1
tools::md5sum( fn1 )
## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
## "83e61c96092285b60d7bf6b0dbc7072e"
dat <- readLines( fn1 )
length( dat )
## [1] 4148721
No segfault occurs.
On Sat, 15 Jul 2017, Anthony Damico wrote:
> hi, i realized that the segfault happens on the text file in a new R
> session. so, creating the segfault-generating text file requires a
> contributed package, but prompting the actual segfault does not -- pretty
> sure that means this is a base R bug? submitted here:
> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i am
> not doing something remarkably stupid. the text file itself is 4GB so
> cannot upload it to bugzilla, and from the R_AllocStringBugger error in the
> previous message, i think most or all of it needs to be there to trigger
> the segfault. thanks!
>
>
> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com> wrote:
>
>> hi, thanks Dr. Murdoch
>>
>>
>> i'd appreciate if anyone on r-help could help me narrow this down? i
>> believe the segfault occurs because there's a single line with 4GB and also
>> embedded nuls, but i am not sure how to artificially construct that?
>>
>>
>> the lodown package can be removed from my example.. it is just for file
>> download cacheing, so `lodown::cachaca` can be replaced with
>> `download.file` my current example requires a huge download, so sort of
>> painful to repeat but i'm pretty confident that's not the issue.
>>
>>
>> the archive::archive_extract() function unzips a (probably corrupt) .RAR
>> file and creates a text file with 80,937 lines. this file is 4GB:
>>
>> > file.size(infile)
>> [1] 4078192743 <(407)%20819-2743>
>>
>>
>> i am pretty sure that nearly all of that 4GB is contained on a single line
>> in the file. here's what happens when i create a file connection and scan
>> through..
>>
>> > file_con <- file( infile , 'r' )
>> >
>> > first_80936_lines <- readLines( file_con , n = 80936 )
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "1000023930632009"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "36F2924009PAULO"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "AFONSO"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "BA11"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "00000"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "00"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "2924009PAULO"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "AFONSO"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "BA1111"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "467.20"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "346.10"
>> > scan( w , n = 1 , what = character() )
>> Read 1 item
>> [1] "414.40"
>> > scan( w , n = 1 , what = character() )
>> Error in scan(w, n = 1, what = character()) :
>> could not allocate memory (2048 Mb) in C function
>> 'R_AllocStringBuffer'
>>
>>
>>
>> making a huge single-line file does not reproduce the problem, i think the
>> embedded nuls have something to do with it--
>>
>>
>> # WARNING do not run with less than 64GB RAM
>> tf <- tempfile()
>> a <- rep( "a" , 1000000000 )
>> b <- paste( a , collapse = '' )
>> writeLines( b , tf ) ; rm( b ) ; gc()
>> d <- readLines( tf )
>>
>>
>>
>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
>> wrote:
>>
>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>>>
>>>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>>>> i think i should submit to https://bugs.r-project.org/ unless others
>>>> have
>>>> advice? thanks
>>>>
>>>
>>> Segfaults are usually worth reporting as bugs. Try to come up with a
>>> self-contained example, not using the lodown and archive packages. I
>>> imagine you can do this by uploading the file you downloaded, or enough of
>>> a subset of it to trigger the segfault. If you can't do that, then likely
>>> the bug is with one of those packages, not with R.
>>>
>>> Duncan Murdoch
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>> install.packages( "devtools" )
>>>> devtools::install_github("ajdamico/lodown")
>>>> devtools::install_github("jimhester/archive")
>>>>
>>>>
>>>> file_folder <- file.path( tempdir() , "file_folder" )
>>>>
>>>> tf <- tempfile()
>>>>
>>>> # large download! cachaca saves on your local disk if already downloaded
>>>> lodown::cachaca( '
>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>>>> mode
>>>> = 'wb' )
>>>>
>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>>>
>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
>>>> full.names =
>>>> TRUE )
>>>>
>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>>>
>>>> # works
>>>> R.utils::countLines( infile )
>>>>
>>>> # works with warning
>>>> my_file <- readLines( infile , skipNul = TRUE )
>>>>
>>>> # crash
>>>> my_file <- readLines( infile )
>>>>
>>>>
>>>> # run just before crash
>>>> sessionInfo()
>>>> # R version 3.4.1 (2017-06-30)
>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>> # Running under: Windows 10 x64 (build 15063)
>>>>
>>>> # Matrix products: default
>>>>
>>>> # locale:
>>>> # [1] LC_COLLATE=English_United States.1252
>>>> # [2] LC_CTYPE=English_United States.1252
>>>> # [3] LC_MONETARY=English_United States.1252
>>>> # [4] LC_NUMERIC=C
>>>> # [5] LC_TIME=English_United States.1252
>>>>
>>>> # attached base packages:
>>>> # [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> # loaded via a namespace (and not attached):
>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1
>>>> withr_1.0.2
>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11
>>>> memoise_1.1.0
>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12
>>>> lodown_0.1.0
>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2
>>>> R.oo_1.21.0
>>>> # [17] archive_0.0.0.9000
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
More information about the R-help
mailing list