[R] readLines without skipNul=TRUE causes crash

Duncan Murdoch murdoch.duncan at gmail.com
Sat Jul 15 18:01:43 CEST 2017


On 15/07/2017 11:33 AM, Anthony Damico wrote:
> hi, i realized that the segfault happens on the text file in a new R
> session.  so, creating the segfault-generating text file requires a
> contributed package, but prompting the actual segfault does not --
> pretty sure that means this is a base R bug?  submitted here:
> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311  hopefully i
> am not doing something remarkably stupid.  the text file itself is 4GB
> so cannot upload it to bugzilla, and from the R_AllocStringBugger error
> in the previous message, i think most or all of it needs to be there to
> trigger the segfault.  thanks!

Hopefully someone can debug it with the info you provided.

Duncan Murdoch

>
> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com
> <mailto:ajdamico at gmail.com>> wrote:
>
>     hi, thanks Dr. Murdoch
>
>
>     i'd appreciate if anyone on r-help could help me narrow this down?
>     i believe the segfault occurs because there's a single line with 4GB
>     and also embedded nuls, but i am not sure how to artificially
>     construct that?
>
>
>     the lodown package can be removed from my example..  it is just for
>     file download cacheing, so `lodown::cachaca` can be replaced with
>     `download.file`  my current example requires a huge download, so
>     sort of painful to repeat but i'm pretty confident that's not the issue.
>
>
>     the archive::archive_extract() function unzips a (probably corrupt)
>     .RAR file and creates a text file with 80,937 lines.  this file is 4GB:
>
>         > file.size(infile)
>         [1] 4078192743 <tel:(407)%20819-2743>
>
>
>     i am pretty sure that nearly all of that 4GB is contained on a
>     single line in the file.  here's what happens when i create a file
>     connection and scan through..
>
>         > file_con <- file( infile , 'r' )
>         >
>         > first_80936_lines <- readLines( file_con , n = 80936 )
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "1000023930632009"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "36F2924009PAULO"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "AFONSO"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "BA11"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "00000"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "00"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "2924009PAULO"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "AFONSO"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "BA1111"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "467.20"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "346.10"
>         > scan( w , n = 1 , what = character() )
>         Read 1 item
>         [1] "414.40"
>         > scan( w , n = 1 , what = character() )
>         Error in scan(w, n = 1, what = character()) :
>           could not allocate memory (2048 Mb) in C function
>     'R_AllocStringBuffer'
>
>
>
>     making a huge single-line file does not reproduce the problem, i
>     think the embedded nuls have something to do with it--
>
>
>         # WARNING do not run with less than 64GB RAM
>         tf <- tempfile()
>         a <- rep( "a" , 1000000000 )
>         b <- paste( a , collapse = '' )
>         writeLines( b , tf ) ; rm( b ) ; gc()
>         d <- readLines( tf )
>
>
>
>     On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch
>     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote:
>
>         On 15/07/2017 7:35 AM, Anthony Damico wrote:
>
>             hello, the last line of the code below causes a segfault for
>             me on 3.4.1.
>             i think i should submit to https://bugs.r-project.org/
>             unless others have
>             advice?  thanks
>
>
>         Segfaults are usually worth reporting as bugs.  Try to come up
>         with a self-contained example, not using the lodown and archive
>         packages.  I imagine you can do this by uploading the file you
>         downloaded, or enough of a subset of it to trigger the
>         segfault.  If you can't do that, then likely the bug is with one
>         of those packages, not with R.
>
>         Duncan Murdoch
>
>
>
>
>
>
>             install.packages( "devtools" )
>             devtools::install_github("ajdamico/lodown")
>             devtools::install_github("jimhester/archive")
>
>
>             file_folder <- file.path( tempdir() , "file_folder" )
>
>             tf <- tempfile()
>
>             # large download!  cachaca saves on your local disk if
>             already downloaded
>             lodown::cachaca( '
>             http://download.inep.gov.br/microdados/microdados_enem2009.rar
>             <http://download.inep.gov.br/microdados/microdados_enem2009.rar>'
>             , tf , mode
>             = 'wb' )
>
>             archive::archive_extract( tf , dir = normalizePath(
>             file_folder ) )
>
>             unzipped_files <- list.files( file_folder , recursive = TRUE
>             , full.names =
>             TRUE  )
>
>             infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value
>             = TRUE )
>
>             # works
>             R.utils::countLines( infile )
>
>             # works with warning
>             my_file <- readLines( infile , skipNul = TRUE )
>
>             # crash
>             my_file <- readLines( infile )
>
>
>             # run just before crash
>             sessionInfo()
>             # R version 3.4.1 (2017-06-30)
>             # Platform: x86_64-w64-mingw32/x64 (64-bit)
>             # Running under: Windows 10 x64 (build 15063)
>
>             # Matrix products: default
>
>             # locale:
>             # [1] LC_COLLATE=English_United States.1252
>             # [2] LC_CTYPE=English_United States.1252
>             # [3] LC_MONETARY=English_United States.1252
>             # [4] LC_NUMERIC=C
>             # [5] LC_TIME=English_United States.1252
>
>             # attached base packages:
>             # [1] stats     graphics  grDevices utils     datasets
>             methods   base
>
>             # loaded via a namespace (and not attached):
>              # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>                withr_1.0.2
>              # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>             memoise_1.1.0
>              # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>                 lodown_0.1.0
>             # [13] R.utils_2.5.0      rlang_0.1.1
>             devtools_1.13.2    R.oo_1.21.0
>             # [17] archive_0.0.0.9000
>
>                     [[alternative HTML version deleted]]
>
>             ______________________________________________
>             R-help at r-project.org <mailto:R-help at r-project.org> mailing
>             list -- To UNSUBSCRIBE and more, see
>             https://stat.ethz.ch/mailman/listinfo/r-help
>             <https://stat.ethz.ch/mailman/listinfo/r-help>
>             PLEASE do read the posting guide
>             http://www.R-project.org/posting-guide.html
>             <http://www.R-project.org/posting-guide.html>
>             and provide commented, minimal, self-contained, reproducible
>             code.
>
>
>
>



More information about the R-help mailing list