[R] read in large data file (tsv) with inline filter?
tlumley at u.washington.edu
Tue Mar 24 09:05:35 CET 2009
On Mon, 23 Mar 2009, David Reiss wrote:
> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls to
> read.delim() with incrementing "skip" and "nrows" parameters, followed by
> grep() calls is very slow.
You certainly don't want to use repeated reads from the start of the file with skip=, but if you set up a file connection
fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without going back to the start each time.
The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck.
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help