[R] readLines without skipNul=TRUE causes crash
Duncan Murdoch
murdoch.duncan at gmail.com
Sat Jul 15 18:01:43 CEST 2017
On 15/07/2017 11:33 AM, Anthony Damico wrote:
> hi, i realized that the segfault happens on the text file in a new R
> session. so, creating the segfault-generating text file requires a
> contributed package, but prompting the actual segfault does not --
> pretty sure that means this is a base R bug? submitted here:
> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i
> am not doing something remarkably stupid. the text file itself is 4GB
> so cannot upload it to bugzilla, and from the R_AllocStringBugger error
> in the previous message, i think most or all of it needs to be there to
> trigger the segfault. thanks!
Hopefully someone can debug it with the info you provided.
Duncan Murdoch
>
> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com
> <mailto:ajdamico at gmail.com>> wrote:
>
> hi, thanks Dr. Murdoch
>
>
> i'd appreciate if anyone on r-help could help me narrow this down?
> i believe the segfault occurs because there's a single line with 4GB
> and also embedded nuls, but i am not sure how to artificially
> construct that?
>
>
> the lodown package can be removed from my example.. it is just for
> file download cacheing, so `lodown::cachaca` can be replaced with
> `download.file` my current example requires a huge download, so
> sort of painful to repeat but i'm pretty confident that's not the issue.
>
>
> the archive::archive_extract() function unzips a (probably corrupt)
> .RAR file and creates a text file with 80,937 lines. this file is 4GB:
>
> > file.size(infile)
> [1] 4078192743 <tel:(407)%20819-2743>
>
>
> i am pretty sure that nearly all of that 4GB is contained on a
> single line in the file. here's what happens when i create a file
> connection and scan through..
>
> > file_con <- file( infile , 'r' )
> >
> > first_80936_lines <- readLines( file_con , n = 80936 )
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "1000023930632009"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "36F2924009PAULO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "AFONSO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "BA11"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "00000"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "00"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "2924009PAULO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "AFONSO"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "BA1111"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "467.20"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "346.10"
> > scan( w , n = 1 , what = character() )
> Read 1 item
> [1] "414.40"
> > scan( w , n = 1 , what = character() )
> Error in scan(w, n = 1, what = character()) :
> could not allocate memory (2048 Mb) in C function
> 'R_AllocStringBuffer'
>
>
>
> making a huge single-line file does not reproduce the problem, i
> think the embedded nuls have something to do with it--
>
>
> # WARNING do not run with less than 64GB RAM
> tf <- tempfile()
> a <- rep( "a" , 1000000000 )
> b <- paste( a , collapse = '' )
> writeLines( b , tf ) ; rm( b ) ; gc()
> d <- readLines( tf )
>
>
>
> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch
> <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote:
>
> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>
> hello, the last line of the code below causes a segfault for
> me on 3.4.1.
> i think i should submit to https://bugs.r-project.org/
> unless others have
> advice? thanks
>
>
> Segfaults are usually worth reporting as bugs. Try to come up
> with a self-contained example, not using the lodown and archive
> packages. I imagine you can do this by uploading the file you
> downloaded, or enough of a subset of it to trigger the
> segfault. If you can't do that, then likely the bug is with one
> of those packages, not with R.
>
> Duncan Murdoch
>
>
>
>
>
>
> install.packages( "devtools" )
> devtools::install_github("ajdamico/lodown")
> devtools::install_github("jimhester/archive")
>
>
> file_folder <- file.path( tempdir() , "file_folder" )
>
> tf <- tempfile()
>
> # large download! cachaca saves on your local disk if
> already downloaded
> lodown::cachaca( '
> http://download.inep.gov.br/microdados/microdados_enem2009.rar
> <http://download.inep.gov.br/microdados/microdados_enem2009.rar>'
> , tf , mode
> = 'wb' )
>
> archive::archive_extract( tf , dir = normalizePath(
> file_folder ) )
>
> unzipped_files <- list.files( file_folder , recursive = TRUE
> , full.names =
> TRUE )
>
> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value
> = TRUE )
>
> # works
> R.utils::countLines( infile )
>
> # works with warning
> my_file <- readLines( infile , skipNul = TRUE )
>
> # crash
> my_file <- readLines( infile )
>
>
> # run just before crash
> sessionInfo()
> # R version 3.4.1 (2017-06-30)
> # Platform: x86_64-w64-mingw32/x64 (64-bit)
> # Running under: Windows 10 x64 (build 15063)
>
> # Matrix products: default
>
> # locale:
> # [1] LC_COLLATE=English_United States.1252
> # [2] LC_CTYPE=English_United States.1252
> # [3] LC_MONETARY=English_United States.1252
> # [4] LC_NUMERIC=C
> # [5] LC_TIME=English_United States.1252
>
> # attached base packages:
> # [1] stats graphics grDevices utils datasets
> methods base
>
> # loaded via a namespace (and not attached):
> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1
> withr_1.0.2
> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11
> memoise_1.1.0
> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12
> lodown_0.1.0
> # [13] R.utils_2.5.0 rlang_0.1.1
> devtools_1.13.2 R.oo_1.21.0
> # [17] archive_0.0.0.9000
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing
> list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> <https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible
> code.
>
>
>
>
More information about the R-help
mailing list