[R] read in large data file (tsv) with inline filter?

Dirk Eddelbuettel edd at debian.org
Tue Mar 24 13:12:46 CET 2009

Hi David,

On 23 March 2009 at 15:09, Dylan Beaudette wrote:
| On Monday 23 March 2009, David Reiss wrote:
| > I have a very large tab-delimited file, too big to store in memory via
| > readLines() or read.delim(). Turns out I only need a few hundred of those
| > lines to be read in. If it were not so large, I could read the entire file
| > in and "grep" the lines I need. For such a large file; many calls to
| > read.delim() with incrementing "skip" and "nrows" parameters, followed by
| > grep() calls is very slow. I am aware of possibilities via SQLite; I would
| > prefer to not use that in this case.
| >
| > My question is...Is there a function for efficiently reading in a file
| > along the lines of read.delim(), which allows me to specify a filter (via
| > grep or something else) that tells the function to only read in certain
| > lines that match?
| >
| > If not, I would *love* to see a "filter" parameter added as an option to
| > read.delim() and/or readLines().
| How about pre-filtering before loading the data into R:
| grep -E 'your pattern here' your_file_here > your_filtered_file
| alternatively if you need to search in fields, see 'awk', and 'cut', or if you 
| need to delete things see 'tr'.
| These tools come with any unix-like OS, and you can probably get them on 
| windows without much effort.

Also note that read.delim() and friends all read from connections, and 'piped
expressions' (in the Unix shell command sense) can provide a source.

That way you can build an ad-hoc filter extension by running readLines() over
a pipe() connection.  Consider this trivial example of grepping out Section
headers from the R FAQ.  We get everything double because of the Table of
Contents and the actual section headers:

R> readLines( pipe("awk '/^[0-9+] / {print $1, $2, $3}' src/debian/R/R-alpha.20090320/doc/FAQ") )
 [1] "1 Introduction " "2 R Basics"      "3 R and"         "4 R Web"        
 [5] "5 R Add-On"      "6 R and"         "7 R Miscellanea" "8 R Programming"
 [9] "9 R Bugs"        "1 Introduction " "2 R Basics"      "3 R and"        
[13] "4 R Web"         "5 R Add-On"      "6 R and"         "7 R Miscellanea"
[17] "8 R Programming" "9 R Bugs"       

The regexp is simply 'digits at start of line followed by space' which skips
subsections like 1.1, 1.2, ...

Hth, Dirk

Three out of two people have difficulties with fractions.

More information about the R-help mailing list