[R] scan() vs readChar() speed

baptiste auguie baptiste.auguie at googlemail.com
Mon Apr 2 05:59:42 CEST 2012


Thanks; I did not notice an appreciable difference between scan() and
scan(what=double()) in this example.
Adding to my confusion, I noted a strange and apparently systematic
discrepency between the timing results when the code is run within
R.app, within emacs, or from a terminal. Any idea what might be
causing this?

Thanks,

baptiste

On 2 April 2012 11:04, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
> On 12-04-01 2:58 AM, baptiste auguie wrote:
>>
>> Dear list,
>>
>> I am trying to find a fast solution to read moderately large (1 -- 10
>> million entries) text files containing only tab-delimited numeric
>> values. My test file is the following,
>>
>> nr<- 1000
>> nc<- 5000
>>
>> m<- matrix(round(rnorm(nr*nc),3),nr=nr)
>> write.table(m, file = "a.txt", append=FALSE,
>>             row.names = FALSE, col.names = FALSE)
>>
>>
>> scan() is faster than read.table(), as expected, but still quite slow
>> compared to Matlab for example. Based on archived discussions on this
>> list and Stack Overflow, I tried readChar(); it's really fast.
>> However, it returns a long character string, where I really want
>> numeric values. I can use as.numeric(strsplit()), but to my complete
>> surprise it is faster to run scan() on this text string. Consider the
>> following comparison (I use the command line wc to optimize the memory
>> allocation),
>
>
> Tell it the types of the columns, and it will go a bit faster.
>
> Duncan Murdoch
>
>>
>> load_file1<- function(f){
>>   ## ask wc the number of words
>>   n<- scan(textConnection(system(paste("wc -w ", f), intern=TRUE)),
>>             what=list(integer(), character()), quiet=TRUE)[[1]]
>>   all<- scan(f, nmax=n, quiet=TRUE)
>>   invisible(all)
>> }
>>
>> load_file2<- function(f){
>>   ## ask wc the number of characters
>>   n<- scan(textConnection(system(paste("wc -m ", f), intern=TRUE)),
>>             what=list(integer(), character()), quiet=TRUE)[[1]]
>>   tc<- textConnection(readChar(f, n))
>>   all<- scan(tc, quiet=TRUE, multi.line = FALSE)
>>   close(tc)
>>   invisible(all)
>> }
>>
>>
>> system.time(a<- load_file1("a.txt"))
>>  ## user  system elapsed
>>  ##  7.805   0.138   8.026
>> system.time(b<- load_file2("a.txt"))
>>  ## user  system elapsed
>>  ##  2.182   0.301   2.538
>> all.equal(a, b)
>> ##>  [1] TRUE
>>
>>
>> Could someone explain to me why it is faster to scan a textConnection
>> than the original file? Have I missed a better solution?
>>
>> Thanks,
>>
>> baptiste
>>
>> sessionInfo()
>> R version 2.15.0 RC (2012-03-29 r58868)
>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list