[R] Large regular expressions
Gabor Grothendieck
ggrothendieck at gmail.com
Mon Jan 26 19:57:32 CET 2009
I am using
> R.version.string # Vista
[1] "R version 2.8.1 Patched (2008-12-26 r47350)"
and it also caused R to actually crash for me.
On Mon, Jan 26, 2009 at 1:38 PM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:
> Given a vector of reference strings Ref and a vector of test strings
> Test, I would like to find elements of Test which do not contain
> elements of Ref as \b-delimited substrings.
>
> This can be done straightforwardly for length(Ref) < 6000 or so (R
> 2.8.1 Windows) by constructing a pattern like \b(a|b|c)\b, but not for
> larger Refs (see below). The easy workaround for this is to split Ref
> into smaller subsets and test each subset separately. Is there a
> better solution e.g. along the lines of fgrep? My real data have
> length(Ref) == 60000 or more.
>
> -s
>
> -----------------------------
>
> Example
>
> Test <- as.character(floor(runif(2000,1,20000))) # Real data is short phrases
>
> testing <- function(n) {
> Ref <- as.character(1:n) # Real data is sentences
> Pat <- paste('\\b(',paste(Ref,collapse="|"),')\\b',sep='')
> grep(Pat,Test)
> }
>
> testing(2000) => no problem
>
> However, testing(10000) gives an error message (invalid regular
> expression) and a warning (memory exhausted), and testing(100000)
> crashes R (Process R exited abnormally with code 5).
>
> Using grep(...,perl=TRUE) as suggested in the man page also fails with
> testing(10000), though it gives a more helpful error message (regular
> expression is too large) without crashing the process.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list