[R] R tools for large files
Prof Brian Ripley
ripley at stats.ox.ac.uk
Wed Aug 27 08:17:47 CEST 2003
I'm bored, but just to point out the obvious fact: to skip n lines in a
text file you have to read *all* the characters in between to find the
line separators.
I have known for 30 years that reading text files of numbers is slow and
inefficient. So do it only once and dump the results to a binary format,
or a RDBMS or ....
On Wed, 27 Aug 2003, Richard A. O'Keefe wrote:
> Duncan Murdoch <dmurdoch at pair.com> wrote:
> For example, if you want to read lines 1000 through 1100, you'd do it
> like this:
>
> lines <- readLines("foo.txt", 1100)[1000:1100]
>
> I created a dataset thus:
> # file foo.awk:
> BEGIN {
> s = "01"
> for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i)
> n = (27 * 1024 * 1024) / (length(s) + 1)
> for (i = 1; i <= n; i++) print s
> exit 0
> }
> # shell command:
> mawk -f foo.awk /dev/null >BIG
>
> That is, each record contains 41 2-digit integers, and the number
> of records was chosen so that the total size was approximately
> 27 dimegabytes. The number of records turns out to be 230,175.
>
> > system.time(v <- readLines("BIG"))
> [1] 7.75 0.17 8.13 0.00 0.00
> # With BIG already in the file system cache...
> > system.time(v <- readLines("BIG", 200000)[199001:200000])
> [1] 11.73 0.16 12.27 0.00 0.00
>
> What's the importance of this?
> First, experiments I shall not weary you with showed that the
> time to read N lines grows faster than N.
> Second, if you want to select the _last_ thousand lines,
> you have to read _all_ of them into memory.
>
> For real efficiency here, what's wanted is a variant of readLines
> where n is an index vector (a vector of non-negative integers,
> a vector of non-positive integers, or a vector of logicals) saying
> which lines should be kept.
>
> The function that would need changing is do_readLines() in
> src/main/connections.c, unfortunately I don't understand R internals
> well enough to do it myself (yet).
>
> As a matter of fact, that _still_ wouldn't yield real efficiency,
> because every character would still have to be read by the modified
> readLines(), and it reads characters using Rconn_fgetc(), which is
> what gives readLines() its power and utility, but certainly doesn't
> give it wings. (One of the fundamental laws of efficient I/O library
> design is to base it on block- or line- at-a-time transfers, not
> character-at-a-time.)
>
> The AWK program
> NR <= 199000 { next }
> {print}
> NR == 200000 { exit }
> extracts lines 199001:20000 in just 0.76 seconds, about 15 times
> faster. A C program to the same effect, using fgets(), took 0.39
> seconds, or about 30 times faster than R.
>
> There are two fairly clear sources of overhead in the R code:
> (1) the overhead of reading characters one at a time through Rconn_fgetc()
> instead of a block or line at a time. mawk doesn't use fgets() for
> reading, and _does_ have the overhead of repeatedly checking a
> regular expression to determine where the end of the line is,
> which it is sensible enough to fast-path.
> (2) the overhead of allocating, filling in, and keeping, a whole lot of
> memory which is of no use whatever in computing the final result.
> mawk is actually fairly careful here, and only keeps one line at
> a time in the program shown above. Let's change it:
> NR <= 199000 {next}
> {a[NR] = $0}
> NR == 200000 {exit}
> END {for (i in a) print a[i]}
> That takes the time from 0.76 seconds to 0.80 seconds
>
> The simplest thing that could possibly work would be to add a function
> skipLines(con, n) which simply read and discarded n lines.
>
> result <- scan(textConnection(lines), list( .... ))
>
> > system.time(m <- scan(textConnection(v), integer(41)))
> Read 41000 items
> [1] 0.99 0.00 1.01 0.00 0.00
>
> One whole second to read 41,000 numbers on a 500 MHz machine?
>
> > vv <- rep(v, 240)
>
> Is there any possibility of storing the data in (platform) binary form?
> Binary connections (R-data.pdf, section 6.5 "Binary connections") can be
> used to read binary-encoded data.
>
> I wrote a little C program to save out the 230175 records of 41 integers
> each in native binary form. Then in R I did
>
> > system.time(m <- readBin("BIN", integer(), n=230175*41, size=4))
> [1] 0.57 0.52 1.11 0.00 0.00
> > system.time(m <- matrix(data=m, ncol=41, byrow=TRUE))
> [1] 2.55 0.34 2.95 0.00 0.00
>
> Remember, this doesn't read a *sample* of the data, it reads *all*
> the data. It is so much faster than the alternatives in R that it
> just isn't funny. Trying scan() on the file took nearly 10 minutes
> before I killed it the other day, using readBin() is a thousand times
> faster than a simple scan() call on this particular data set.
>
> There has *got* to be a way of either generating or saving the data
> in binary form, using only "approved" Windows tools. Heck, it can
> probably be done using VBA.
>
>
> By the way, I've read most of the .pdf files I could find on the CRAN site,
> but haven't noticed any description of the R save-file format. Where should
> I have looked? (Yes, I know about src/main/saveload.c; I was hoping for
> some documentation, with maybe some diagrams.)
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list