[R] Scanning grep through huge files

Duncan Murdoch murdoch at stats.uwo.ca
Tue Nov 3 15:51:28 CET 2009


On 11/3/2009 9:29 AM, Johannes Graumann wrote:
> Hi,
> 
> I'm dealing which huge files I would like to index. On a linux system "grep 
> -buo <PATTERN> <FILENAME>" hands me the byte offsets for "PATTERN" very 
> quickly and I am looking to emulate that speed and ease with native R tools 
> - for portability and elegance. "gregexpr" should be able to do that but I 
> fail to combine it with "scan" or an equivalent to parse the whole file 
> without having to read it all into memory.

I think you are going to have to write this yourself.  R doesn't have 
very many stream oriented functions:  almost everything is aimed at 
having the whole thing in memory.

You will also have trouble with the byte offsets.  The semantics of the 
-u option to grep are quite strange (at least according to the man page 
on Cygwin).

What I'd do given your problem is use readLines to read the file, then 
post-process the result of gregexpr to give line and byte offset pairs 
for each match; those are more useful in R than the rather bizarre "byte 
offsets" that grep -buo will give.  But for a huge file you'll probably 
have to do this in blocks, as the whole file may be too big.

Duncan Murdoch


> 
> I'd be grateful for any hints on how to do this without a "pipe("grep -buo 
> <PATTERN> <FILENAME>")".
> 
> Thanks, Joh
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list