[R] Large regular expressions

Gabor Grothendieck ggrothendieck at gmail.com
Mon Jan 26 19:57:32 CET 2009


I am using

> R.version.string # Vista
[1] "R version 2.8.1 Patched (2008-12-26 r47350)"

and it also caused R to actually crash for me.

On Mon, Jan 26, 2009 at 1:38 PM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:
> Given a vector of reference strings Ref and a vector of test strings
> Test, I would like to find elements of Test which do not contain
> elements of Ref as \b-delimited substrings.
>
> This can be done straightforwardly for length(Ref) < 6000 or so (R
> 2.8.1 Windows) by constructing a pattern like \b(a|b|c)\b, but not for
> larger Refs (see below).  The easy workaround for this is to split Ref
> into smaller subsets and test each subset separately.  Is there a
> better solution e.g. along the lines of fgrep?  My real data have
> length(Ref) == 60000 or more.
>
>              -s
>
> -----------------------------
>
> Example
>
> Test <- as.character(floor(runif(2000,1,20000)))  # Real data is short phrases
>
> testing <- function(n) {
>      Ref <- as.character(1:n)               # Real data is sentences
>      Pat <- paste('\\b(',paste(Ref,collapse="|"),')\\b',sep='')
>      grep(Pat,Test)
> }
>
> testing(2000) => no problem
>
> However, testing(10000) gives an error message (invalid regular
> expression) and a warning (memory exhausted), and testing(100000)
> crashes R (Process R exited abnormally with code 5).
>
> Using grep(...,perl=TRUE) as suggested in the man page also fails with
> testing(10000), though it gives a more helpful error message (regular
> expression is too large) without crashing the process.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list