[R] readLines without skipNul=TRUE causes crash

Anthony Damico ajdamico at gmail.com
Sun Jul 16 18:37:17 CEST 2017


hi, yep, there are two problems -- but i think only the segfault is within
the scope of a base R issue?  i need to look closer at the corrupted
decompression and figure out whether i should talk to the brazilian
government agency that creates that .rar file or open an issue with the
archive package maintainer.  my goal in this thread is only to figure out
how to replicate the goofy text file so the r team can turn it into an
error instead of a segfault.

the original example i sent stores the .txt file somewhere inside the
tempdir(), but when i copy it over elsewhere on my machine, the md5sum()
gives the same result.  thanks again for looking at this

    > tools::md5sum(infile)

C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados
ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
    "30beb57419486108e98d42ec7a2f8b19"


    > tools::md5sum( "S:/temp/crash.txt" )
                     S:/temp/crash.txt
    "30beb57419486108e98d42ec7a2f8b19"




On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
wrote:

> So you are saying there are two problems... one that produces a corrupt
> file from a valid compressed file, and one that segfaults when presented
> with that corrupt file? Can you please confirm the file name and run md5sum
> on it and share the result so we can tell when the file problem has been
> reproduced?
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 16, 2017 3:21:21 AM PDT, Anthony Damico <ajdamico at gmail.com>
> wrote:
> >hi, thank you for attempting this. it looks like your unix machine
> >unzipped
> >the txt file without corruption -- if you copied over the same txt file
> >to
> >windows 7, i don't think that would reproduce the problem?  i think it
> >needs to be the corrupted text file where   R.utils::countLines(
> >txtfile
> >)   gives 809367.  i am able to reproduce on two distinct windows
> >machines
> >but no guarantee i'm not doing something dumb
> >
> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
> ><jdnewmil at dcn.davis.ca.us>
> >wrote:
> >
> >> I am not able to reproduce your segfault on a Windows 7 platform
> >either:
> >>
> >> ##########################
> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
> >> sessionInfo()
> >> ## R version 3.4.1 (2017-06-30)
> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> >> ##
> >> ## Matrix products: default
> >> ##
> >> ## locale:
> >> ## [1] LC_COLLATE=English_United States.1252
> >> ## [2] LC_CTYPE=English_United States.1252
> >> ## [3] LC_MONETARY=English_United States.1252
> >> ## [4] LC_NUMERIC=C
> >> ## [5] LC_TIME=English_United States.1252
> >> ##
> >> ## attached base packages:
> >> ## [1] stats     graphics  grDevices utils     datasets  methods
> >base
> >> ##
> >> ## loaded via a namespace (and not attached):
> >> ## [1] compiler_3.4.1
> >> tools::md5sum( fn1 )
> >> ##             d:/DADOS_ENEM_2009.txt
> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
> >> dat <- readLines( fn1 )
> >> length( dat )
> >> ## [1] 4148721
> >>
> >>
> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
> >>
> >> I am not able to reproduce this on a Linux platform:
> >>>
> >>> #######################3
> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>> 2009/DADOS_ENEM_2009.txt"
> >>> sessionInfo()
> >>> ## R version 3.4.1 (2017-06-30)
> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
> >>> ## Running under: Ubuntu 14.04.5 LTS
> >>> ##
> >>> ## Matrix products: default
> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
> >>> ##
> >>> ## locale:
> >>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> >>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>> ##
> >>> ## attached base packages:
> >>> ## [1] stats     graphics  grDevices utils     datasets  methods
> >base
> >>> ##
> >>> ## loaded via a namespace (and not attached):
> >>> ## [1] compiler_3.4.1
> >>> tools::md5sum( fn1 )
> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >>> 2009/DADOS_ENEM_2009.txt
> >>> ##
> >>> "83e61c96092285b60d7bf6b0dbc7072e"
> >>> dat <- readLines( fn1 )
> >>> length( dat )
> >>> ## [1] 4148721
> >>>
> >>> No segfault occurs.
> >>>
> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
> >>>
> >>> hi, i realized that the segfault happens on the text file in a new R
> >>>> session.  so, creating the segfault-generating text file requires a
> >>>> contributed package, but prompting the actual segfault does not --
> >pretty
> >>>> sure that means this is a base R bug?  submitted here:
> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
> >hopefully i
> >>>> am
> >>>> not doing something remarkably stupid.  the text file itself is 4GB
> >so
> >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger
> >error in
> >>>> the
> >>>> previous message, i think most or all of it needs to be there to
> >trigger
> >>>> the segfault.  thanks!
> >>>>
> >>>>
> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
> ><ajdamico at gmail.com>
> >>>> wrote:
> >>>>
> >>>> hi, thanks Dr. Murdoch
> >>>>>
> >>>>>
> >>>>> i'd appreciate if anyone on r-help could help me narrow this down?
> > i
> >>>>> believe the segfault occurs because there's a single line with 4GB
> >and
> >>>>> also
> >>>>> embedded nuls, but i am not sure how to artificially construct
> >that?
> >>>>>
> >>>>>
> >>>>> the lodown package can be removed from my example..  it is just
> >for file
> >>>>> download cacheing, so `lodown::cachaca` can be replaced with
> >>>>> `download.file`  my current example requires a huge download, so
> >sort of
> >>>>> painful to repeat but i'm pretty confident that's not the issue.
> >>>>>
> >>>>>
> >>>>> the archive::archive_extract() function unzips a (probably
> >corrupt) .RAR
> >>>>> file and creates a text file with 80,937 lines.  this file is 4GB:
> >>>>>
> >>>>>    > file.size(infile)
> >>>>>     [1] 4078192743 <(407)%20819-2743>
> >>>>>
> >>>>>
> >>>>> i am pretty sure that nearly all of that 4GB is contained on a
> >single
> >>>>> line
> >>>>> in the file.  here's what happens when i create a file connection
> >and
> >>>>> scan
> >>>>> through..
> >>>>>
> >>>>>    > file_con <- file( infile , 'r' )
> >>>>>    >
> >>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "1000023930632009"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "36F2924009PAULO"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "AFONSO"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "BA11"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "00000"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "00"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "2924009PAULO"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "AFONSO"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "BA1111"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "467.20"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "346.10"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Read 1 item
> >>>>>     [1] "414.40"
> >>>>>    > scan( w , n = 1 , what = character() )
> >>>>>     Error in scan(w, n = 1, what = character()) :
> >>>>>       could not allocate memory (2048 Mb) in C function
> >>>>> 'R_AllocStringBuffer'
> >>>>>
> >>>>>
> >>>>>
> >>>>> making a huge single-line file does not reproduce the problem, i
> >think
> >>>>> the
> >>>>> embedded nuls have something to do with it--
> >>>>>
> >>>>>
> >>>>>     # WARNING do not run with less than 64GB RAM
> >>>>>     tf <- tempfile()
> >>>>>     a <- rep( "a" , 1000000000 )
> >>>>>     b <- paste( a , collapse = '' )
> >>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
> >>>>>     d <- readLines( tf )
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
> >>>>> murdoch.duncan at gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
> >>>>>>
> >>>>>> hello, the last line of the code below causes a segfault for me
> >on
> >>>>>>> 3.4.1.
> >>>>>>> i think i should submit to https://bugs.r-project.org/  unless
> >others
> >>>>>>> have
> >>>>>>> advice?  thanks
> >>>>>>>
> >>>>>>>
> >>>>>> Segfaults are usually worth reporting as bugs.  Try to come up
> >with a
> >>>>>> self-contained example, not using the lodown and archive
> >packages.  I
> >>>>>> imagine you can do this by uploading the file you downloaded, or
> >>>>>> enough of
> >>>>>> a subset of it to trigger the segfault.  If you can't do that,
> >then
> >>>>>> likely
> >>>>>> the bug is with one of those packages, not with R.
> >>>>>>
> >>>>>> Duncan Murdoch
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> install.packages( "devtools" )
> >>>>>>> devtools::install_github("ajdamico/lodown")
> >>>>>>> devtools::install_github("jimhester/archive")
> >>>>>>>
> >>>>>>>
> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
> >>>>>>>
> >>>>>>> tf <- tempfile()
> >>>>>>>
> >>>>>>> # large download!  cachaca saves on your local disk if already
> >>>>>>> downloaded
> >>>>>>> lodown::cachaca( '
> >>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar'
> >, tf
> >>>>>>> ,
> >>>>>>> mode
> >>>>>>> = 'wb' )
> >>>>>>>
> >>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder
> >) )
> >>>>>>>
> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
> >>>>>>> full.names =
> >>>>>>> TRUE  )
> >>>>>>>
> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value =
> >TRUE )
> >>>>>>>
> >>>>>>> # works
> >>>>>>> R.utils::countLines( infile )
> >>>>>>>
> >>>>>>> # works with warning
> >>>>>>> my_file <- readLines( infile , skipNul = TRUE )
> >>>>>>>
> >>>>>>> # crash
> >>>>>>> my_file <- readLines( infile )
> >>>>>>>
> >>>>>>>
> >>>>>>> # run just before crash
> >>>>>>> sessionInfo()
> >>>>>>> # R version 3.4.1 (2017-06-30)
> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
> >>>>>>> # Running under: Windows 10 x64 (build 15063)
> >>>>>>>
> >>>>>>> # Matrix products: default
> >>>>>>>
> >>>>>>> # locale:
> >>>>>>> # [1] LC_COLLATE=English_United States.1252
> >>>>>>> # [2] LC_CTYPE=English_United States.1252
> >>>>>>> # [3] LC_MONETARY=English_United States.1252
> >>>>>>> # [4] LC_NUMERIC=C
> >>>>>>> # [5] LC_TIME=English_United States.1252
> >>>>>>>
> >>>>>>> # attached base packages:
> >>>>>>> # [1] stats     graphics  grDevices utils     datasets  methods
> > base
> >>>>>>>
> >>>>>>> # loaded via a namespace (and not attached):
> >>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
> >>>>>>>  withr_1.0.2
> >>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
> >>>>>>> memoise_1.1.0
> >>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
> >>>>>>> lodown_0.1.0
> >>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
> >>>>>>> R.oo_1.21.0
> >>>>>>> # [17] archive_0.0.0.9000
> >>>>>>>
> >>>>>>>         [[alternative HTML version deleted]]
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
> >see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
> >>>>>>> ng-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible
> >code.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>         [[alternative HTML version deleted]]
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide http://www.R-project.org/posti
> >>>> ng-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>>
> >>> ------------------------------------------------------------
> >>> ---------------
> >>> Jeff Newmiller                        The     .....       .....  Go
> >>> Live...
> >>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
> >Live
> >>> Go...
> >>>                                      Live:   OO#.. Dead: OO#..
> >Playing
> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
> >with
> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
> >>> rocks...1k
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posti
> >>> ng-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >> ------------------------------------------------------------
> >> ---------------
> >> Jeff Newmiller                        The     .....       .....  Go
> >Live...
> >> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> >> Go...
> >>                                       Live:   OO#.. Dead: OO#..
> >Playing
> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> >> /Software/Embedded Controllers)               .OO#.       .OO#.
> >rocks...1k
> >> ------------------------------------------------------------
> >> ---------------
> >>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list