[R] readLines without skipNul=TRUE causes crash

Anthony Damico ajdamico at gmail.com
Mon Jul 17 14:00:48 CEST 2017


hi, thanks again for taking the time.  since corrupted compression prompted
the segfault for me in the first place, i've just posted the text file
as-is.  it's a 2.4GB file so to be avoided on a metered internet
connection.  i've updated the bugzilla report at
https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 with more
relevant info.  these lines of code crash both windows R 3.4.1 and also
linux R 3.3.3 for me.  thanks again


    # consider changing `tempfile()` to a permanent location
    # so you don't lose the large downloaded file after the crash
    tf <- tempfile()
    download.file( "https://sisyphus.project.cwi.nl/r-bug-17311-crash.txt"
, tf , mode = 'wb' )
    sessionInfo()
    x <- readLines( tf )




On Sun, Jul 16, 2017 at 2:22 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
wrote:

> I am stuck. The archive package won't compile for me on Ubuntu, and the
> CRANextra repo seems to be down so I cannot install packages on Windows
> right now. Perhaps you can zip the corrupt text file and put it online
> somewhere? Don't use the archive package to pack it since there seem to be
> issues with that tool on your machine.
>
> I would discourage you from harassing the Brazilian government about their
> RAR file because the RAR file seems fine (no NUL characters appear in the
> text file) when extracted using the file-roller archive tool on Ubuntu.
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 16, 2017 9:37:17 AM PDT, Anthony Damico <ajdamico at gmail.com>
> wrote:
> >hi, yep, there are two problems -- but i think only the segfault is
> >within
> >the scope of a base R issue?  i need to look closer at the corrupted
> >decompression and figure out whether i should talk to the brazilian
> >government agency that creates that .rar file or open an issue with the
> >archive package maintainer.  my goal in this thread is only to figure
> >out
> >how to replicate the goofy text file so the r team can turn it into an
> >error instead of a segfault.
> >
> >the original example i sent stores the .txt file somewhere inside the
> >tempdir(), but when i copy it over elsewhere on my machine, the
> >md5sum()
> >gives the same result.  thanks again for looking at this
> >
> >    > tools::md5sum(infile)
> >
> >C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_
> folder/Microdados
> >ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt
> >    "30beb57419486108e98d42ec7a2f8b19"
> >
> >
> >    > tools::md5sum( "S:/temp/crash.txt" )
> >                     S:/temp/crash.txt
> >    "30beb57419486108e98d42ec7a2f8b19"
> >
> >
> >
> >
> >On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller
> ><jdnewmil at dcn.davis.ca.us>
> >wrote:
> >
> >> So you are saying there are two problems... one that produces a
> >corrupt
> >> file from a valid compressed file, and one that segfaults when
> >presented
> >> with that corrupt file? Can you please confirm the file name and run
> >md5sum
> >> on it and share the result so we can tell when the file problem has
> >been
> >> reproduced?
> >> --
> >> Sent from my phone. Please excuse my brevity.
> >>
> >> On July 16, 2017 3:21:21 AM PDT, Anthony Damico <ajdamico at gmail.com>
> >> wrote:
> >> >hi, thank you for attempting this. it looks like your unix machine
> >> >unzipped
> >> >the txt file without corruption -- if you copied over the same txt
> >file
> >> >to
> >> >windows 7, i don't think that would reproduce the problem?  i think
> >it
> >> >needs to be the corrupted text file where   R.utils::countLines(
> >> >txtfile
> >> >)   gives 809367.  i am able to reproduce on two distinct windows
> >> >machines
> >> >but no guarantee i'm not doing something dumb
> >> >
> >> >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller
> >> ><jdnewmil at dcn.davis.ca.us>
> >> >wrote:
> >> >
> >> >> I am not able to reproduce your segfault on a Windows 7 platform
> >> >either:
> >> >>
> >> >> ##########################
> >> >> fn1 <- "d:/DADOS_ENEM_2009.txt"
> >> >> sessionInfo()
> >> >> ## R version 3.4.1 (2017-06-30)
> >> >> ## Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1
> >> >> ##
> >> >> ## Matrix products: default
> >> >> ##
> >> >> ## locale:
> >> >> ## [1] LC_COLLATE=English_United States.1252
> >> >> ## [2] LC_CTYPE=English_United States.1252
> >> >> ## [3] LC_MONETARY=English_United States.1252
> >> >> ## [4] LC_NUMERIC=C
> >> >> ## [5] LC_TIME=English_United States.1252
> >> >> ##
> >> >> ## attached base packages:
> >> >> ## [1] stats     graphics  grDevices utils     datasets  methods
> >> >base
> >> >> ##
> >> >> ## loaded via a namespace (and not attached):
> >> >> ## [1] compiler_3.4.1
> >> >> tools::md5sum( fn1 )
> >> >> ##             d:/DADOS_ENEM_2009.txt
> >> >> ## "83e61c96092285b60d7bf6b0dbc7072e"
> >> >> dat <- readLines( fn1 )
> >> >> length( dat )
> >> >> ## [1] 4148721
> >> >>
> >> >>
> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote:
> >> >>
> >> >> I am not able to reproduce this on a Linux platform:
> >> >>>
> >> >>> #######################3
> >> >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >> >>> 2009/DADOS_ENEM_2009.txt"
> >> >>> sessionInfo()
> >> >>> ## R version 3.4.1 (2017-06-30)
> >> >>> ## Platform: x86_64-pc-linux-gnu (64-bit)
> >> >>> ## Running under: Ubuntu 14.04.5 LTS
> >> >>> ##
> >> >>> ## Matrix products: default
> >> >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0
> >> >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0
> >> >>> ##
> >> >>> ## locale:
> >> >>> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >> >>> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >> >>> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> >> >>> ##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >> >>> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >> >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >> >>> ##
> >> >>> ## attached base packages:
> >> >>> ## [1] stats     graphics  grDevices utils     datasets  methods
> >> >base
> >> >>> ##
> >> >>> ## loaded via a namespace (and not attached):
> >> >>> ## [1] compiler_3.4.1
> >> >>> tools::md5sum( fn1 )
> >> >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem
> >> >>> 2009/DADOS_ENEM_2009.txt
> >> >>> ##
> >> >>> "83e61c96092285b60d7bf6b0dbc7072e"
> >> >>> dat <- readLines( fn1 )
> >> >>> length( dat )
> >> >>> ## [1] 4148721
> >> >>>
> >> >>> No segfault occurs.
> >> >>>
> >> >>> On Sat, 15 Jul 2017, Anthony Damico wrote:
> >> >>>
> >> >>> hi, i realized that the segfault happens on the text file in a
> >new R
> >> >>>> session.  so, creating the segfault-generating text file
> >requires a
> >> >>>> contributed package, but prompting the actual segfault does not
> >--
> >> >pretty
> >> >>>> sure that means this is a base R bug?  submitted here:
> >> >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311
> >> >hopefully i
> >> >>>> am
> >> >>>> not doing something remarkably stupid.  the text file itself is
> >4GB
> >> >so
> >> >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger
> >> >error in
> >> >>>> the
> >> >>>> previous message, i think most or all of it needs to be there to
> >> >trigger
> >> >>>> the segfault.  thanks!
> >> >>>>
> >> >>>>
> >> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico
> >> ><ajdamico at gmail.com>
> >> >>>> wrote:
> >> >>>>
> >> >>>> hi, thanks Dr. Murdoch
> >> >>>>>
> >> >>>>>
> >> >>>>> i'd appreciate if anyone on r-help could help me narrow this
> >down?
> >> > i
> >> >>>>> believe the segfault occurs because there's a single line with
> >4GB
> >> >and
> >> >>>>> also
> >> >>>>> embedded nuls, but i am not sure how to artificially construct
> >> >that?
> >> >>>>>
> >> >>>>>
> >> >>>>> the lodown package can be removed from my example..  it is just
> >> >for file
> >> >>>>> download cacheing, so `lodown::cachaca` can be replaced with
> >> >>>>> `download.file`  my current example requires a huge download,
> >so
> >> >sort of
> >> >>>>> painful to repeat but i'm pretty confident that's not the
> >issue.
> >> >>>>>
> >> >>>>>
> >> >>>>> the archive::archive_extract() function unzips a (probably
> >> >corrupt) .RAR
> >> >>>>> file and creates a text file with 80,937 lines.  this file is
> >4GB:
> >> >>>>>
> >> >>>>>    > file.size(infile)
> >> >>>>>     [1] 4078192743 <(407)%20819-2743>
> >> >>>>>
> >> >>>>>
> >> >>>>> i am pretty sure that nearly all of that 4GB is contained on a
> >> >single
> >> >>>>> line
> >> >>>>> in the file.  here's what happens when i create a file
> >connection
> >> >and
> >> >>>>> scan
> >> >>>>> through..
> >> >>>>>
> >> >>>>>    > file_con <- file( infile , 'r' )
> >> >>>>>    >
> >> >>>>>    > first_80936_lines <- readLines( file_con , n = 80936 )
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "1000023930632009"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "36F2924009PAULO"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "AFONSO"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "BA11"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "00000"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "00"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "2924009PAULO"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "AFONSO"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "BA1111"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "467.20"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "346.10"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Read 1 item
> >> >>>>>     [1] "414.40"
> >> >>>>>    > scan( w , n = 1 , what = character() )
> >> >>>>>     Error in scan(w, n = 1, what = character()) :
> >> >>>>>       could not allocate memory (2048 Mb) in C function
> >> >>>>> 'R_AllocStringBuffer'
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> making a huge single-line file does not reproduce the problem,
> >i
> >> >think
> >> >>>>> the
> >> >>>>> embedded nuls have something to do with it--
> >> >>>>>
> >> >>>>>
> >> >>>>>     # WARNING do not run with less than 64GB RAM
> >> >>>>>     tf <- tempfile()
> >> >>>>>     a <- rep( "a" , 1000000000 )
> >> >>>>>     b <- paste( a , collapse = '' )
> >> >>>>>     writeLines( b , tf ) ; rm( b ) ; gc()
> >> >>>>>     d <- readLines( tf )
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <
> >> >>>>> murdoch.duncan at gmail.com>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote:
> >> >>>>>>
> >> >>>>>> hello, the last line of the code below causes a segfault for
> >me
> >> >on
> >> >>>>>>> 3.4.1.
> >> >>>>>>> i think i should submit to https://bugs.r-project.org/
> >unless
> >> >others
> >> >>>>>>> have
> >> >>>>>>> advice?  thanks
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>> Segfaults are usually worth reporting as bugs.  Try to come up
> >> >with a
> >> >>>>>> self-contained example, not using the lodown and archive
> >> >packages.  I
> >> >>>>>> imagine you can do this by uploading the file you downloaded,
> >or
> >> >>>>>> enough of
> >> >>>>>> a subset of it to trigger the segfault.  If you can't do that,
> >> >then
> >> >>>>>> likely
> >> >>>>>> the bug is with one of those packages, not with R.
> >> >>>>>>
> >> >>>>>> Duncan Murdoch
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> install.packages( "devtools" )
> >> >>>>>>> devtools::install_github("ajdamico/lodown")
> >> >>>>>>> devtools::install_github("jimhester/archive")
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" )
> >> >>>>>>>
> >> >>>>>>> tf <- tempfile()
> >> >>>>>>>
> >> >>>>>>> # large download!  cachaca saves on your local disk if
> >already
> >> >>>>>>> downloaded
> >> >>>>>>> lodown::cachaca( '
> >> >>>>>>>
> >http://download.inep.gov.br/microdados/microdados_enem2009.rar'
> >> >, tf
> >> >>>>>>> ,
> >> >>>>>>> mode
> >> >>>>>>> = 'wb' )
> >> >>>>>>>
> >> >>>>>>> archive::archive_extract( tf , dir = normalizePath(
> >file_folder
> >> >) )
> >> >>>>>>>
> >> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE
> >,
> >> >>>>>>> full.names =
> >> >>>>>>> TRUE  )
> >> >>>>>>>
> >> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value =
> >> >TRUE )
> >> >>>>>>>
> >> >>>>>>> # works
> >> >>>>>>> R.utils::countLines( infile )
> >> >>>>>>>
> >> >>>>>>> # works with warning
> >> >>>>>>> my_file <- readLines( infile , skipNul = TRUE )
> >> >>>>>>>
> >> >>>>>>> # crash
> >> >>>>>>> my_file <- readLines( infile )
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> # run just before crash
> >> >>>>>>> sessionInfo()
> >> >>>>>>> # R version 3.4.1 (2017-06-30)
> >> >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
> >> >>>>>>> # Running under: Windows 10 x64 (build 15063)
> >> >>>>>>>
> >> >>>>>>> # Matrix products: default
> >> >>>>>>>
> >> >>>>>>> # locale:
> >> >>>>>>> # [1] LC_COLLATE=English_United States.1252
> >> >>>>>>> # [2] LC_CTYPE=English_United States.1252
> >> >>>>>>> # [3] LC_MONETARY=English_United States.1252
> >> >>>>>>> # [4] LC_NUMERIC=C
> >> >>>>>>> # [5] LC_TIME=English_United States.1252
> >> >>>>>>>
> >> >>>>>>> # attached base packages:
> >> >>>>>>> # [1] stats     graphics  grDevices utils     datasets
> >methods
> >> > base
> >> >>>>>>>
> >> >>>>>>> # loaded via a namespace (and not attached):
> >> >>>>>>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
> >> >>>>>>>  withr_1.0.2
> >> >>>>>>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
> >> >>>>>>> memoise_1.1.0
> >> >>>>>>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
> >> >>>>>>> lodown_0.1.0
> >> >>>>>>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
> >> >>>>>>> R.oo_1.21.0
> >> >>>>>>> # [17] archive_0.0.0.9000
> >> >>>>>>>
> >> >>>>>>>         [[alternative HTML version deleted]]
> >> >>>>>>>
> >> >>>>>>> ______________________________________________
> >> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
> >> >see
> >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>>>>> PLEASE do read the posting guide
> >http://www.R-project.org/posti
> >> >>>>>>> ng-guide.html
> >> >>>>>>> and provide commented, minimal, self-contained, reproducible
> >> >code.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>         [[alternative HTML version deleted]]
> >> >>>>
> >> >>>> ______________________________________________
> >> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
> >see
> >> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>> PLEASE do read the posting guide http://www.R-project.org/posti
> >> >>>> ng-guide.html
> >> >>>> and provide commented, minimal, self-contained, reproducible
> >code.
> >> >>>>
> >> >>>>
> >> >>> ------------------------------------------------------------
> >> >>> ---------------
> >> >>> Jeff Newmiller                        The     .....       .....
> >Go
> >> >>> Live...
> >> >>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
> >> >Live
> >> >>> Go...
> >> >>>                                      Live:   OO#.. Dead: OO#..
> >> >Playing
> >> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
> >> >with
> >> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
> >> >>> rocks...1k
> >> >>>
> >> >>> ______________________________________________
> >> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> PLEASE do read the posting guide http://www.R-project.org/posti
> >> >>> ng-guide.html
> >> >>> and provide commented, minimal, self-contained, reproducible
> >code.
> >> >>>
> >> >>>
> >> >> ------------------------------------------------------------
> >> >> ---------------
> >> >> Jeff Newmiller                        The     .....       .....
> >Go
> >> >Live...
> >> >> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
> >Live
> >> >> Go...
> >> >>                                       Live:   OO#.. Dead: OO#..
> >> >Playing
> >> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.
> >with
> >> >> /Software/Embedded Controllers)               .OO#.       .OO#.
> >> >rocks...1k
> >> >> ------------------------------------------------------------
> >> >> ---------------
> >> >>
> >>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list