[R] readLines without skipNul=TRUE causes crash
Anthony Damico
ajdamico at gmail.com
Sat Jul 15 16:32:32 CEST 2017
hi, thanks Dr. Murdoch
i'd appreciate if anyone on r-help could help me narrow this down? i
believe the segfault occurs because there's a single line with 4GB and also
embedded nuls, but i am not sure how to artificially construct that?
the lodown package can be removed from my example.. it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file` my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.
the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines. this file is 4GB:
> file.size(infile)
[1] 4078192743
i am pretty sure that nearly all of that 4GB is contained on a single line
in the file. here's what happens when i create a file connection and scan
through..
> file_con <- file( infile , 'r' )
>
> first_80936_lines <- readLines( file_con , n = 80936 )
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "1000023930632009"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "36F2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA11"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "00000"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "00"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "2924009PAULO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "AFONSO"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "BA1111"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "467.20"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "346.10"
> scan( w , n = 1 , what = character() )
Read 1 item
[1] "414.40"
> scan( w , n = 1 , what = character() )
Error in scan(w, n = 1, what = character()) :
could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'
making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something to do with it--
# WARNING do not run with less than 64GB RAM
tf <- tempfile()
a <- rep( "a" , 1000000000 )
b <- paste( a , collapse = '' )
writeLines( b , tf ) ; rm( b ) ; gc()
d <- readLines( tf )
On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>
>> hello, the last line of the code below causes a segfault for me on 3.4.1.
>> i think i should submit to https://bugs.r-project.org/ unless others
>> have
>> advice? thanks
>>
>
> Segfaults are usually worth reporting as bugs. Try to come up with a
> self-contained example, not using the lodown and archive packages. I
> imagine you can do this by uploading the file you downloaded, or enough of
> a subset of it to trigger the segfault. If you can't do that, then likely
> the bug is with one of those packages, not with R.
>
> Duncan Murdoch
>
>
>>
>>
>>
>>
>> install.packages( "devtools" )
>> devtools::install_github("ajdamico/lodown")
>> devtools::install_github("jimhester/archive")
>>
>>
>> file_folder <- file.path( tempdir() , "file_folder" )
>>
>> tf <- tempfile()
>>
>> # large download! cachaca saves on your local disk if already downloaded
>> lodown::cachaca( '
>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf ,
>> mode
>> = 'wb' )
>>
>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>
>> unzipped_files <- list.files( file_folder , recursive = TRUE , full.names
>> =
>> TRUE )
>>
>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE )
>>
>> # works
>> R.utils::countLines( infile )
>>
>> # works with warning
>> my_file <- readLines( infile , skipNul = TRUE )
>>
>> # crash
>> my_file <- readLines( infile )
>>
>>
>> # run just before crash
>> sessionInfo()
>> # R version 3.4.1 (2017-06-30)
>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> # Running under: Windows 10 x64 (build 15063)
>>
>> # Matrix products: default
>>
>> # locale:
>> # [1] LC_COLLATE=English_United States.1252
>> # [2] LC_CTYPE=English_United States.1252
>> # [3] LC_MONETARY=English_United States.1252
>> # [4] LC_NUMERIC=C
>> # [5] LC_TIME=English_United States.1252
>>
>> # attached base packages:
>> # [1] stats graphics grDevices utils datasets methods base
>>
>> # loaded via a namespace (and not attached):
>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1
>> withr_1.0.2
>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11
>> memoise_1.1.0
>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12
>> lodown_0.1.0
>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2
>> R.oo_1.21.0
>> # [17] archive_0.0.0.9000
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list