[Rd] proposed changes to RSiteSearch

Fri May 8 12:37:26 CEST 2009

Romain Francois wrote:
>
>    txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
> TRUE )
>      rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
>    out <- data.frame(
>        url = gsub( rx, "\\1", txt ),
>        group = gsub( rx, "\\2", txt ),
>        description = gsub( rx, "\\3", txt ),

looking at this bit of your code, i wonder why gsub is not vectorized
for the pattern and replacement arguments, although it is for the x
argument.  the three lines above could be collapsed to just one with a
vectorized gsub:

    gsubm = function(pattern, replacement, x, ...)
       mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
           gsub, pattern=pattern, replacement=replacement, x=x, ...)

for example, given the sample data

    txt = '<foo>foo</foo><bar>bar</bar>'
    rx = '<(.*?)>(.*?)</(.*?)>'

the sequence

    open = gsub(rx, '\\1', txt, perl=TRUE)
    content = gsub(rx, '\\2', txt, perl=TRUE)
    close = gsub(rx, '\\3', txt, perl=TRUE)

    print(list(open, content, close))

could be replaced with

    data = structure(names=c('open', 'content', 'close'),
        gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))

    print(data)

surely, a call to mapply does not improve performance, but a
source-level fix should not be too difficult;  unfortunately, i can't
find myself willing to struggle with r sources right now.

note also that .*? does not work as a non-greedy .* with the default
regex engine, e.g.,

    txt = "foo='FOO' bar='BAR'"
    gsub("(.*?)='(.*?)'", '\\1', txt)
    # "foo='FOO' bar"
    gsub("(.*?)='(.*?)'", '\\2', txt)
    # "BAR"

because the first .*? matches everyithng up to and exclusive of the
second, *not* the first, '='.  for a non-greedy match, you'd need pcre
(and using pcre generally improves performance anyway):

    txt = "foo='FOO' bar='BAR'"
    gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
    # "foo bar"
    gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
    # "FOO BAR"

vQ