[R] Search within a file
Tuszynski, Jaroslaw W.
JAROSLAW.W.TUSZYNSKI at saic.com
Fri Nov 4 16:53:40 CET 2005
Thanks for a great suggestions. I guess the code you suggested would look
something like this:
fregexpr = function(pattern, filename)
{ # same as gregexpr but operating on files not strings
# Only single string 'pattern's allowed
buf.size=1024
n = file.info(filename)$size
pos = NULL
fp = file(filename, "rb")
for (d in seq(1,n,by=buf.size)) {
m = if (n-d>buf.size) buf.size else n-d
p = gregexpr(pattern, readChar(fp, m))[[1]]
if(p[1]>0) pos=c(pos, p+d-1)
}
close(fp)
if (is.null(pos)) pos=-1
return (pos)
}
> fname = file.path(R.home(),"COPYING")
> fregexpr("right", fname)
[1] 73 1347 1422 1460 1727 1879 1908 1939 3106 3350 4240 5530
[13] 6637 6661 6740 9460 9534 10503 11756 12528 12566 13805 15907 16056
[25] 17053 17681 17813
> gregexpr("right", readChar(fname,file.info(fname)$size))[[1]]
[1] 73 1347 1422 1460 1727 1879 1908 1939 3106 3350 4240 5530
[13] 6637 6661 6740 9460 9534 10503 11756 12528 12566 13805 15907 16056
[25] 17053 17681 17813
attr(,"match.length")
[1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
The function above does what I need, if someone needs a function that
parallels gregexpr but operates on files not strings, than most of the work
would be in modifying line "if(p[1]>0) pos=c(pos, p+d-1)" to do
concatination and addition on lists.
Thanks
Jarek Tuszynski
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Seth Falcon
Sent: Friday, November 04, 2005 1:44 AM
To: r-help at stat.math.ethz.ch
Subject: Re: [R] Search within a file
On 3 Nov 2005, JAROSLAW.W.TUSZYNSKI at saic.com wrote:
> I am looking for a way to search a file for position of some
> expression, from within R. My current code:
>
> sha1Pos = gregexpr("<sha1>", readChar(filename,
> file.info(filename)$size))[[1]]
>
> Works fine for small files, but text files I will be working with
> might get up to Gb range, so I was trying to accomplish the same
> without loading the whole file into R.
I would think you could use readLines to read in a batch of lines, run
(g)regexpr, and keep track of matches and position.
Create a connection to the file using file() first, and then subsequent
calls to readLines will start where you left off.
But you will need to adjust the position indices returned by gregexpr by how
far into the file you are. Seems very doable.
+ seth
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list