[R] R tools for large files

Wed Aug 27 05:37:15 CEST 2003

If we are going to use unix tools to create a new dataset before calling
into R, why not simply use 

cat my_big_bad_file | tail +1001 | head -100

to read lines 1000-1100 (assuming one header row).

Or if you have the shortlisted rownames in one file, you can use join
after sort. A working example follows.

########################################################################
#########

#!/bin/bash

# match.sh last modified 10/07/03
# Does the same thing as egrep 'a|b|c|...' file but in batch mode
# A script that matches all occurances of <shortlist> in <data> using
the first column as common key 

if [ $# -ne 2 ]; then
   echo "Usage: ${0/*\/} <shortlist> <data>"
   exit
fi

TEMP1=/tmp/temp1.`date "+%y%m%d-%H%M%S"`
TEMP2=/tmp/temp2.`date "+%y%m%d-%H%M%S"`
TEMP3=/tmp/temp3.`date "+%y%m%d-%H%M%S"`
TEMP4=/tmp/temp4.`date "+%y%m%d-%H%M%S"`
TEMP5=/tmp/temp5.`date "+%y%m%d-%H%M%S"`

grep -n . $1 | cut -f1 -d: | paste - $1 > $TEMP1
sort -k 2 $TEMP1 > $TEMP2             

tail +2 $2 | sort -k 1 > $TEMP3  # Assume data file has header 

headerRow=`head -1 $2`

join -j1 2 -j2 1 -a 1 -t\        $TEMP2 $TEMP3 > $TEMP4
sort -n -k 2 $TEMP4 > $TEMP5

/bin/echo "$headerRow"
cut -f1,3- $TEMP5                # column 2 contains orderings

rm $TEMP1 $TEMP2 $TEMP3 $TEMP4

########################################################################
#####

-----Original Message-----
From: Richard A. O'Keefe [mailto:ok at cs.otago.ac.nz] 
Sent: Wednesday, August 27, 2003 9:04 AM
To: r-help at stat.math.ethz.ch
Subject: Re: [R] R tools for large files

Duncan Murdoch <dmurdoch at pair.com> wrote:
	For example, if you want to read lines 1000 through 1100, you'd
do it
	like this:

	 lines <- readLines("foo.txt", 1100)[1000:1100]

I created a dataset thus:
# file foo.awk:
BEGIN {
    s = "01"
    for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i)
    n = (27 * 1024 * 1024) / (length(s) + 1)
    for (i = 1; i <= n; i++) print s
    exit 0
}
# shell command:
mawk -f foo.awk /dev/null >BIG

That is, each record contains 41 2-digit integers, and the number of
records was chosen so that the total size was approximately 27
dimegabytes.  The number of records turns out to be 230,175.

> system.time(v <- readLines("BIG"))
[1] 7.75 0.17 8.13 0.00 0.00
	# With BIG already in the file system cache...
> system.time(v <- readLines("BIG", 200000)[199001:200000])
[1] 11.73  0.16 12.27  0.00  0.00

What's the importance of this?
First, experiments I shall not weary you with showed that the time to
read N lines grows faster than N. Second, if you want to select the
_last_ thousand lines, you have to read _all_ of them into memory.

For real efficiency here, what's wanted is a variant of readLines where
n is an index vector (a vector of non-negative integers, a vector of
non-positive integers, or a vector of logicals) saying which lines
should be kept.

The function that would need changing is do_readLines() in
src/main/connections.c, unfortunately I don't understand R internals
well enough to do it myself (yet).

As a matter of fact, that _still_ wouldn't yield real efficiency,
because every character would still have to be read by the modified
readLines(), and it reads characters using Rconn_fgetc(), which is what
gives readLines() its power and utility, but certainly doesn't give it
wings.  (One of the fundamental laws of efficient I/O library design is
to base it on block- or line- at-a-time transfers, not
character-at-a-time.)

The AWK program
    NR <= 199000 { next }
    {print}
    NR == 200000 { exit }
extracts lines 199001:20000 in just 0.76 seconds, about 15 times faster.
A C program to the same effect, using fgets(), took 0.39 seconds, or
about 30 times faster than R.

There are two fairly clear sources of overhead in the R code:
(1) the overhead of reading characters one at a time through
Rconn_fgetc()
    instead of a block or line at a time.  mawk doesn't use fgets() for
    reading, and _does_ have the overhead of repeatedly checking a
    regular expression to determine where the end of the line is,
    which it is sensible enough to fast-path.
(2) the overhead of allocating, filling in, and keeping, a whole lot of
    memory which is of no use whatever in computing the final result.
    mawk is actually fairly careful here, and only keeps one line at
    a time in the program shown above.  Let's change it:
	NR <= 199000 {next}
	{a[NR] = $0}
	NR == 200000 {exit}
	END {for (i in a) print a[i]}
    That takes the time from 0.76 seconds to 0.80 seconds

The simplest thing that could possibly work would be to add a function
skipLines(con, n) which simply read and discarded n lines.

	 result <- scan(textConnection(lines), list( .... ))

> system.time(m <- scan(textConnection(v), integer(41)))
Read 41000 items
[1] 0.99 0.00 1.01 0.00 0.00

One whole second to read 41,000 numbers on a 500 MHz machine?

> vv <- rep(v, 240)

Is there any possibility of storing the data in (platform) binary form?
Binary connections (R-data.pdf, section 6.5 "Binary connections") can be
used to read binary-encoded data.

I wrote a little C program to save out the 230175 records of 41 integers
each in native binary form.  Then in R I did

> system.time(m <- readBin("BIN", integer(), n=230175*41, size=4))
[1] 0.57 0.52 1.11 0.00 0.00
> system.time(m <- matrix(data=m, ncol=41, byrow=TRUE))
[1] 2.55 0.34 2.95 0.00 0.00

Remember, this doesn't read a *sample* of the data, it reads *all* the
data.  It is so much faster than the alternatives in R that it just
isn't funny.  Trying scan() on the file took nearly 10 minutes before I
killed it the other day, using readBin() is a thousand times faster than
a simple scan() call on this particular data set.

There has *got* to be a way of either generating or saving the data in
binary form, using only "approved" Windows tools.  Heck, it can probably
be done using VBA.

By the way, I've read most of the .pdf files I could find on the CRAN
site, but haven't noticed any description of the R save-file format.
Where should I have looked?  (Yes, I know about src/main/saveload.c; I
was hoping for some documentation, with maybe some diagrams.)

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help