[R] Large regular expressions
Stavros Macrakis
macrakis at alum.mit.edu
Mon Jan 26 19:38:07 CET 2009
Given a vector of reference strings Ref and a vector of test strings
Test, I would like to find elements of Test which do not contain
elements of Ref as \b-delimited substrings.
This can be done straightforwardly for length(Ref) < 6000 or so (R
2.8.1 Windows) by constructing a pattern like \b(a|b|c)\b, but not for
larger Refs (see below). The easy workaround for this is to split Ref
into smaller subsets and test each subset separately. Is there a
better solution e.g. along the lines of fgrep? My real data have
length(Ref) == 60000 or more.
-s
-----------------------------
Example
Test <- as.character(floor(runif(2000,1,20000))) # Real data is short phrases
testing <- function(n) {
Ref <- as.character(1:n) # Real data is sentences
Pat <- paste('\\b(',paste(Ref,collapse="|"),')\\b',sep='')
grep(Pat,Test)
}
testing(2000) => no problem
However, testing(10000) gives an error message (invalid regular
expression) and a warning (memory exhausted), and testing(100000)
crashes R (Process R exited abnormally with code 5).
Using grep(...,perl=TRUE) as suggested in the man page also fails with
testing(10000), though it gives a more helpful error message (regular
expression is too large) without crashing the process.
More information about the R-help
mailing list