[R] Large regular expressions

Stavros Macrakis macrakis at alum.mit.edu
Mon Jan 26 19:38:07 CET 2009

Given a vector of reference strings Ref and a vector of test strings
Test, I would like to find elements of Test which do not contain
elements of Ref as \b-delimited substrings.

This can be done straightforwardly for length(Ref) < 6000 or so (R
2.8.1 Windows) by constructing a pattern like \b(a|b|c)\b, but not for
larger Refs (see below).  The easy workaround for this is to split Ref
into smaller subsets and test each subset separately.  Is there a
better solution e.g. along the lines of fgrep?  My real data have
length(Ref) == 60000 or more.




Test <- as.character(floor(runif(2000,1,20000)))  # Real data is short phrases

testing <- function(n) {
      Ref <- as.character(1:n)               # Real data is sentences
      Pat <- paste('\\b(',paste(Ref,collapse="|"),')\\b',sep='')

testing(2000) => no problem

However, testing(10000) gives an error message (invalid regular
expression) and a warning (memory exhausted), and testing(100000)
crashes R (Process R exited abnormally with code 5).

Using grep(...,perl=TRUE) as suggested in the man page also fails with
testing(10000), though it gives a more helpful error message (regular
expression is too large) without crashing the process.

More information about the R-help mailing list