[R] reading very large files
Marc Schwartz
marc_schwartz at comcast.net
Fri Feb 2 20:04:04 CET 2007
On Fri, 2007-02-02 at 12:42 -0600, Marc Schwartz wrote:
> On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote:
> > Juli,
> >
> > I don't have a file to test this on, so caveat emptor.
> >
> > The problem with the approach above, is that you are re-reading the
> > source file, once per line, or 3000 times. In addition, each read is
> > likely going through half the file on average to locate the randomly
> > selected line. Thus, the reality is that you are probably reading on the
> > order of:
> >
> > > 3000 * 450000
> > [1] 1.35e+09
> >
> > lines in the file, which of course if going to be quite slow.
> >
> > In addition, you are also writing to the target file 3000 times.
> >
> > The basic premise with this approach below, is that you are in effect
> > creating a sequential file cache in an R object. Reading large chunks of
> > the source file into the cache. Then randomly selecting rows within the
> > cache and then writing out the selected rows.
> >
> > Thus, if you can read 100,000 rows at once, you would have 9 reads of
> > the source file, and 9 writes of the target file.
> >
> > The key thing here is to ensure that the offsets within the cache and
> > the corresponding random row values are properly set.
> >
> > Here's the code:
> >
> > # Generate the random values
> > sel <- sample(1:900000, 3000)
> >
> > # Set up a sequence for the cache chunks
> > # Presume you can read 100,000 rows at once
> > Cuts <- seq(0, 900000, 100000)
> >
> > # Loop over the length of Cuts, less 1
> > for (i in seq(along = Cuts[-1]))
> > {
> > # Get a 100,000 row chunk, skipping rows
> > # as appropriate for each subsequent chunk
> > Chunk <- scan("myfile", what = character(), sep = "\n",
> > skip = Cuts[i], nlines = 100000)
> >
> > # set up a row sequence for the current
> > # chunk
> > Rows <- (Cuts[i] + 1):(Cuts[i + 1])
> >
> > # Are any of the random values in the
> > # current chunk?
> > Chunk.Sel <- sel[which(sel %in% Rows)]
> >
> > # If so, get them
> > if (length(Chunk.Sel) > 0)
> > {
> > Write.Rows <- Chunk[sel - Cuts[i]]
>
>
> Quick typo correction:
>
> The last line above should be:
>
> Write.Rows <- Chunk[sel - Cuts[i], ]
>
>
> > # Now write them out
> > write(Write.Rows, "myfile_short", append = TRUE)
> > }
> > }
> >
OK, I knew it was too good to be true...
One more correction on that same line:
Write.Rows <- Chunk[Chunk.Sel - Cuts[i], ]
For clarity, here is the full set of code:
# Generate the random values
sel <- sample(900000, 3000)
# Set up a sequence for the cache chunks
# Presume you can read 100,000 rows at once
Cuts <- seq(0, 900000, 100000)
# Loop over the length of Cuts, less 1
for (i in seq(along = Cuts[-1]))
{
# Get a 100,000 row chunk, skipping rows
# as appropriate for each subsequent chunk
Chunk <- scan("myfile", what = character(), sep = "\n",
skip = Cuts[i], nlines = 100000)
# set up a row sequence for the current
# chunk
Rows <- (Cuts[i] + 1):(Cuts[i + 1])
# Are any of the random values in the
# current chunk?
Chunk.Sel <- sel[which(sel %in% Rows)]
# If so, get them
if (length(Chunk.Sel) > 0)
{
Write.Rows <- Chunk[Chunk.Sel - Cuts[i], ]
# Now write them out
write(Write.Rows, "myfile_short", append = TRUE)
}
}
Regards,
Marc
More information about the R-help
mailing list