[R] regexpr: R takes very long with non-existent pattern

Andrew Simmons @kw@|mmo @end|ng |rom gm@||@com
Thu May 19 02:45:44 CEST 2022


Hello again,


I think I know why it takes so long. In 'regexpr', the first line is:


if (!is.character(text))
    text <- as.character(text)


so it changes 'text' to a valid object before checking the other
argument's validity. I know why it does that first, it's because you
want to pick up the as.character methods at the R level: at the C
level, you might use coerceVector(text, STRSXP), but that will only
dispatch to the internal methods, and won't dispatch to methods set
with S3method() or methods::setMethod("as.character"). So yes it could
be changed to check the simpler arguments' validity first, but it
would cost some functionality in the process.
I haven't checked to see if there is an as.character method for
classes xml_document or xml_node, but my guess is that they're
extremely slow for the size you're dealing with.

Unfortunate, but not much you can do about it without sacrificing versatility.


Regards,
    Andrew Simmons

On Wed, May 18, 2022 at 5:35 PM Leonard Mada <leo.mada using syonic.eu> wrote:
>
> Dear Andrew,
>
>
> I screwed it a little bit up. The object was not a string vector, but an
> xml object (the original xml with the abstracts).
>
> str(x)
> List of 2
>   $ node:<externalptr>
>   $ doc :<externalptr>
>   - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
>
>
> i pasted the R code for a function but had an error, which stopped the
> parsing of the function. But the next lines were still executed:
>
> npos = regexpr(patt, x, perl=TRUE);
> # Error in regexpr(patt, x, perl = TRUE) : object 'patt' not found
>
>
> Variable x was actually the xml object - my mistake. It still takes 1-2
> minutes to generate the final error.
>
> Is regexpr trying to parse the xml with as.character first (I have not
> checked this)?
>
> It makes more sense to first parse the regex expression.
>
>
> Sincerely,
>
>
> Leonard
>
> On 5/19/2022 3:26 AM, Andrew Simmons wrote:
> > Hello,
> >
> >
> > I tried this myself, something like:
> >
> >
> > dat <- utils::read.csv(
> >      "https://raw.githubusercontent.com/discoleo/R/master/TextMining/Pubmed/Example_Abstracts_Title_Pubmed.csv",
> >      check.names = FALSE
> > )
> >
> >
> > regexpr(patt, dat$Abstract, perl = TRUE)
> > regexpr(patt, dat$Title, perl = TRUE)
> >
> >
> > and I can't reproduce your issue. Mine seems to raise the error within
> > a second or less that object 'patt' does not exist. I'm using R 4.2.0
> > and Windows 11, though that shouldn't be making a difference: if you
> > look at Sys.info(), it's still Windows 10 with a build version of
> > 22000. Don't really know what else to say, have you tried it again
> > since?
> >
> >
> > Regards,
> >      Andrew Simmons
> >
> > On Wed, May 18, 2022 at 5:09 PM Leonard Mada via R-help
> > <r-help using r-project.org> wrote:
> >> Dear R Users,
> >>
> >>
> >> I have run the following command in R:
> >>
> >> # x = larger vector of strings (1200 Pubmed abstracts);
> >> # patt = not defined;
> >> npos = regexpr(patt, x, perl=TRUE);
> >> # Error in regexpr(patt, x, perl = TRUE) : object 'patt' not found
> >>
> >>
> >> The problem:
> >>
> >> R becomes unresponsive and it takes 1-2 minutes to return the error. The
> >> operation completes almost instantaneously with a valid pattern.
> >>
> >> Is there a reason for this behavior?
> >>
> >> Tested with R 4.2.0 on MS Windows 10.
> >>
> >>
> >> I have uploaded a set with 1200 Pubmed abstracts on Github, if anyone
> >> wants to check:
> >>
> >> - see file: Example_Abstracts_Title_Pubmed.csv;
> >>
> >> https://github.com/discoleo/R/tree/master/TextMining/Pubmed
> >>
> >> The variable patt was not defined due to an error: but it took very long
> >> to exit the operation and report the error.
> >>
> >>
> >> Many thanks,
> >>
> >>
> >> Leonard
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list