[R] R tools for large files

Tue Aug 26 04:15:46 CEST 2003

> From: Richard A. O'Keefe [mailto:ok at cs.otago.ac.nz] 
> 
> Murray Jorgensen <maj at stats.waikato.ac.nz> wrote:
> 	"Large" for my purposes means "more than I really want to read
> 	into memory" which in turn means "takes more than 30s".  I'm at
> 	home now and the file isn't so I'm not sure if the file is large
> 	or not.
> 	
> I repeat my earlier observation.  The AMOUNT OF DATA is 
> easily handled a typical desktop machine these days.  The 
> problem is not the amount of data.  The problem is HOW LONG 
> IT TAKES TO READ.  I made several attempts to read the test 
> file I created yesterday, and each time gave up impatiently 
> after 5+ minutes elapsed time.  I tried again today (see 
> below) and went away to have a cop of tea &c; took nearly 10 
> minute that time and still hadn't finished.  'mawk' read _and 
> processed_ the same file happily in under 30 seconds.
> 
> One quite serious alternative would be to write a little C 
> function to read the file into an array, and call that from R.
> 
> > system.time(m <- matrix(1:(41*250000), nrow=250000, ncol=41))
> [1] 3.28 0.79 4.28 0.00 0.00
> > system.time(save(m, file="m.bin"))
> [1] 8.44 0.54 9.08 0.00 0.00
> > m <- NULL
> > system.time(load("m.bin"))
> [1] 11.25  0.19 11.51  0.00  0.00
> > length(m)
> [1] 10250000

I tried the following on my IBM T22 Thinkpad (P3-933 w/ 512MB):

> system.time(x <- matrix(runif(41*250000), 250000, 41))
[1] 6.02 0.40 6.52   NA   NA
> object.size(x)
[1] 82000120
> system.time(write(t(x), file="try.dat", ncol=41))
[1] 192.12  81.60 279.64     NA     NA
> system.time(xx <- matrix(scan("try.dat"), byrow=TRUE, ncol=41))
Read 10250000 items
[1] 110.90   1.09 126.89     NA     NA
> system.time(xx <- read.table("try.dat", header=FALSE,
+ colClasses=rep("numeric", 41)))
[1] 106.61   0.48 110.66     NA     NA
> system.time(save(x, file="try.rda"))
[1]  9.15  1.05 19.12    NA    NA
> rm(x)
> system.time(load("try.rda"))
[1] 10.22  0.33 10.69    NA    NA

The last few lines show that the timing I get is approximately the
same as yours, so the other timings shouldn't be too different.

I don't think I can make coffee that fast.  (No, I don't drink it black!)

Andy

> 
> The binary file m.bin is 41 million bytes.
> 
> This little transcript shows that a data set of this size can 
> be comfortably read from disc in under 12 seconds, on the 
> same machine where scan() took about 50 times as long before 
> I killed it.
> 
> So yet another alternative is to write a little program that 
> converts the data file to R binary format, and then just read 
> the whole thing in. I think readers will agree that 12 
> seconds on a 500MHz machine counts as "takes less than 30s".
> 
> 	It's just that R is so good in reading in initial 
> segments of a file that I
> 	can't believe that it can't be effective in reading more general
> 	(pre-specified) subsets.
> 	
> R is *good* at it, it's just not *quick*.  Trying to select a 
> subset in scan() or read.table() wouldn't help all that much, 
> because it would still have to *scan* the data to determine 
> what to skip.
> 
> Two more times:
> An unoptimised C program writing 0:(41*250000-1) as a file of 
> 41-number lines: f% time a.out >m.txt 13.0u 1.0s 0:14 94% 
> 0+0k 0+0io 0pf+0w
> > system.time(m <- read.table("m.txt", header=FALSE))
> ^C
> Timing stopped at: 552.01 15.48 584.51 0 0 
> 
> To my eyes, src/main/scan.c shows no signs of having been 
> tuned for speed. The goals appear to have been power (the R 
> scan() function has LOTS of
> options) and correctness, which are perfectly good goals, and 
> the speed of scan() and read.table() with modest data sizes 
> is quite good enough.
> 
> The huge ratio (>552)/(<30) for R/mawk does suggest that 
> there may be room for some serious improvement in scan(), 
> possibly by means of some extra hints about total size, 
> possibly by creating a fast path through the code.
> 
> Of course the big point is that however long scan() takes to 
> read the data set, it only has to be done once.  Leave R 
> running overnight and in the morning save the dataset out as 
> an R binary file using save(). Then you'll be able to load it 
> again quickly.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA), and/or
its affiliates (which may be known outside the United States as Merck Frosst,
Merck Sharp & Dohme or MSD) that may be confidential, proprietary copyrighted
and/or legally privileged, and is intended solely for the use of the
individual or entity named on this message.  If you are not the intended
recipient, and have received this message in error, please immediately return
this by e-mail and then delete it.