[R] reading very large files

Fri Feb 2 19:32:15 CET 2007

On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote:
> Hi all,
> I have a large file (1.8 GB) with 900,000 lines that I would like to read.
> Each line is a string characters. Specifically I would like to randomly
> select 3000 lines. For smaller files, what I'm doing is:
> 
> trs <- scan("myfile", what= character(), sep = "\n")
> trs<- trs[sample(length(trs), 3000)]
> 
> And this works OK; however my computer seems not able to handle the 1.8 G
> file.
> I thought of an alternative way that not require to read the whole file:
> 
> sel <- sample(1:900000, 3000)
> for (i in 1:3000)  {
> un <- scan("myfile", what= character(), sep = "\n", skip=sel[i], nlines=1)
> write(un, "myfile_short", append=TRUE)
> }
> 
> This works on my computer; however it is extremely slow; it read one line
> each time. It is been running for 25 hours and I think it has done less than
> half of the file (Yes, probably I do not have a very good computer and I'm
> working under Windows ...).
> So my question is: do you know any other faster way to do this?
> Thanks in advance
> 
> Juli

Juli,

I don't have a file to test this on, so caveat emptor.

The problem with the approach above, is that you are re-reading the
source file, once per line, or 3000 times.  In addition, each read is
likely going through half the file on average to locate the randomly
selected line. Thus, the reality is that you are probably reading on the
order of:

> 3000 * 450000
[1] 1.35e+09

lines in the file, which of course if going to be quite slow.

In addition, you are also writing to the target file 3000 times.

The basic premise with this approach below, is that you are in effect
creating a sequential file cache in an R object. Reading large chunks of
the source file into the cache. Then randomly selecting rows within the
cache and then writing out the selected rows.

Thus, if you can read 100,000 rows at once, you would have 9 reads of
the source file, and 9 writes of the target file.

The key thing here is to ensure that the offsets within the cache and
the corresponding random row values are properly set.

Here's the code:

# Generate the random values
sel <- sample(1:900000, 3000)

# Set up a sequence for the cache chunks
# Presume you can read 100,000 rows at once
Cuts <- seq(0, 900000, 100000)

# Loop over the length of Cuts, less 1
for (i in seq(along = Cuts[-1]))
{
  # Get a 100,000 row chunk, skipping rows
  # as appropriate for each subsequent chunk
  Chunk <- scan("myfile", what = character(), sep = "\n", 
                 skip = Cuts[i], nlines = 100000)

  # set up a row sequence for the current 
  # chunk
  Rows <- (Cuts[i] + 1):(Cuts[i + 1])

  # Are any of the random values in the 
  # current chunk?
  Chunk.Sel <- sel[which(sel %in% Rows)]

  # If so, get them 
  if (length(Chunk.Sel) > 0)
  {
    Write.Rows <- Chunk[sel - Cuts[i]]

    # Now write them out
    write(Write.Rows, "myfile_short", append = TRUE)
  }
}

As noted, I have not tested this, so there may yet be additional ways to
save time with file seeks, etc.

HTH,

Marc Schwartz