[R] R tools for large files

Wed Aug 27 08:17:47 CEST 2003

I'm bored, but just to point out the obvious fact: to skip n lines in a
text file you have to read *all* the characters in between to find the
line separators.

I have known for 30 years that reading text files of numbers is slow and 
inefficient.  So do it only once and dump the results to a binary format, 
or a RDBMS or ....

On Wed, 27 Aug 2003, Richard A. O'Keefe wrote:

> Duncan Murdoch <dmurdoch at pair.com> wrote:
> 	For example, if you want to read lines 1000 through 1100, you'd do it
> 	like this:
> 	
> 	 lines <- readLines("foo.txt", 1100)[1000:1100]
> 
> I created a dataset thus:
> # file foo.awk:
> BEGIN {
>     s = "01"
>     for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i)
>     n = (27 * 1024 * 1024) / (length(s) + 1)
>     for (i = 1; i <= n; i++) print s
>     exit 0
> }
> # shell command:
> mawk -f foo.awk /dev/null >BIG
> 
> That is, each record contains 41 2-digit integers, and the number
> of records was chosen so that the total size was approximately
> 27 dimegabytes.  The number of records turns out to be 230,175.
> 
> > system.time(v <- readLines("BIG"))
> [1] 7.75 0.17 8.13 0.00 0.00
> 	# With BIG already in the file system cache...
> > system.time(v <- readLines("BIG", 200000)[199001:200000])
> [1] 11.73  0.16 12.27  0.00  0.00
> 
> What's the importance of this?
> First, experiments I shall not weary you with showed that the
> time to read N lines grows faster than N.
> Second, if you want to select the _last_ thousand lines,
> you have to read _all_ of them into memory.
> 
> For real efficiency here, what's wanted is a variant of readLines
> where n is an index vector (a vector of non-negative integers,
> a vector of non-positive integers, or a vector of logicals) saying
> which lines should be kept.
> 
> The function that would need changing is do_readLines() in
> src/main/connections.c, unfortunately I don't understand R internals
> well enough to do it myself (yet).
> 
> As a matter of fact, that _still_ wouldn't yield real efficiency,
> because every character would still have to be read by the modified
> readLines(), and it reads characters using Rconn_fgetc(), which is
> what gives readLines() its power and utility, but certainly doesn't
> give it wings.  (One of the fundamental laws of efficient I/O library
> design is to base it on block- or line- at-a-time transfers, not
> character-at-a-time.)
> 
> The AWK program
>     NR <= 199000 { next }
>     {print}
>     NR == 200000 { exit }
> extracts lines 199001:20000 in just 0.76 seconds, about 15 times
> faster.  A C program to the same effect, using fgets(), took 0.39
> seconds, or about 30 times faster than R.
> 
> There are two fairly clear sources of overhead in the R code:
> (1) the overhead of reading characters one at a time through Rconn_fgetc()
>     instead of a block or line at a time.  mawk doesn't use fgets() for
>     reading, and _does_ have the overhead of repeatedly checking a
>     regular expression to determine where the end of the line is,
>     which it is sensible enough to fast-path.
> (2) the overhead of allocating, filling in, and keeping, a whole lot of
>     memory which is of no use whatever in computing the final result.
>     mawk is actually fairly careful here, and only keeps one line at
>     a time in the program shown above.  Let's change it:
> 	NR <= 199000 {next}
> 	{a[NR] = $0}
> 	NR == 200000 {exit}
> 	END {for (i in a) print a[i]}
>     That takes the time from 0.76 seconds to 0.80 seconds
> 
> The simplest thing that could possibly work would be to add a function
> skipLines(con, n) which simply read and discarded n lines.
> 
> 	 result <- scan(textConnection(lines), list( .... ))
> 	
> > system.time(m <- scan(textConnection(v), integer(41)))
> Read 41000 items
> [1] 0.99 0.00 1.01 0.00 0.00
> 
> One whole second to read 41,000 numbers on a 500 MHz machine?
> 
> > vv <- rep(v, 240)
> 
> Is there any possibility of storing the data in (platform) binary form?
> Binary connections (R-data.pdf, section 6.5 "Binary connections") can be
> used to read binary-encoded data.
> 
> I wrote a little C program to save out the 230175 records of 41 integers
> each in native binary form.  Then in R I did
> 
> > system.time(m <- readBin("BIN", integer(), n=230175*41, size=4))
> [1] 0.57 0.52 1.11 0.00 0.00
> > system.time(m <- matrix(data=m, ncol=41, byrow=TRUE))
> [1] 2.55 0.34 2.95 0.00 0.00
> 
> Remember, this doesn't read a *sample* of the data, it reads *all*
> the data.  It is so much faster than the alternatives in R that it
> just isn't funny.  Trying scan() on the file took nearly 10 minutes
> before I killed it the other day, using readBin() is a thousand times
> faster than a simple scan() call on this particular data set.
> 
> There has *got* to be a way of either generating or saving the data
> in binary form, using only "approved" Windows tools.  Heck, it can
> probably be done using VBA.
> 
> 
> By the way, I've read most of the .pdf files I could find on the CRAN site,
> but haven't noticed any description of the R save-file format.  Where should
> I have looked?  (Yes, I know about src/main/saveload.c; I was hoping for
> some documentation, with maybe some diagrams.)
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 
> 

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595