[R] Scanning grep through huge files
murdoch at stats.uwo.ca
Tue Nov 3 15:51:28 CET 2009
On 11/3/2009 9:29 AM, Johannes Graumann wrote:
> I'm dealing which huge files I would like to index. On a linux system "grep
> -buo <PATTERN> <FILENAME>" hands me the byte offsets for "PATTERN" very
> quickly and I am looking to emulate that speed and ease with native R tools
> - for portability and elegance. "gregexpr" should be able to do that but I
> fail to combine it with "scan" or an equivalent to parse the whole file
> without having to read it all into memory.
I think you are going to have to write this yourself. R doesn't have
very many stream oriented functions: almost everything is aimed at
having the whole thing in memory.
You will also have trouble with the byte offsets. The semantics of the
-u option to grep are quite strange (at least according to the man page
What I'd do given your problem is use readLines to read the file, then
post-process the result of gregexpr to give line and byte offset pairs
for each match; those are more useful in R than the rather bizarre "byte
offsets" that grep -buo will give. But for a huge file you'll probably
have to do this in blocks, as the whole file may be too big.
> I'd be grateful for any hints on how to do this without a "pipe("grep -buo
> <PATTERN> <FILENAME>")".
> Thanks, Joh
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help