[R] read in large data file (tsv) with inline filter?
Thomas Lumley
tlumley at u.washington.edu
Tue Mar 24 09:05:35 CET 2009
On Mon, 23 Mar 2009, David Reiss wrote:
> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls to
> read.delim() with incrementing "skip" and "nrows" parameters, followed by
> grep() calls is very slow.
You certainly don't want to use repeated reads from the start of the file with skip=, but if you set up a file connection
fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without going back to the start each time.
The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list