[Rd] proposed changes to RSiteSearch
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Fri May 8 12:37:26 CEST 2009
Romain Francois wrote:
>
> txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
> TRUE )
> rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
> out <- data.frame(
> url = gsub( rx, "\\1", txt ),
> group = gsub( rx, "\\2", txt ),
> description = gsub( rx, "\\3", txt ),
looking at this bit of your code, i wonder why gsub is not vectorized
for the pattern and replacement arguments, although it is for the x
argument. the three lines above could be collapsed to just one with a
vectorized gsub:
gsubm = function(pattern, replacement, x, ...)
mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
gsub, pattern=pattern, replacement=replacement, x=x, ...)
for example, given the sample data
txt = '<foo>foo</foo><bar>bar</bar>'
rx = '<(.*?)>(.*?)</(.*?)>'
the sequence
open = gsub(rx, '\\1', txt, perl=TRUE)
content = gsub(rx, '\\2', txt, perl=TRUE)
close = gsub(rx, '\\3', txt, perl=TRUE)
print(list(open, content, close))
could be replaced with
data = structure(names=c('open', 'content', 'close'),
gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))
print(data)
surely, a call to mapply does not improve performance, but a
source-level fix should not be too difficult; unfortunately, i can't
find myself willing to struggle with r sources right now.
note also that .*? does not work as a non-greedy .* with the default
regex engine, e.g.,
txt = "foo='FOO' bar='BAR'"
gsub("(.*?)='(.*?)'", '\\1', txt)
# "foo='FOO' bar"
gsub("(.*?)='(.*?)'", '\\2', txt)
# "BAR"
because the first .*? matches everyithng up to and exclusive of the
second, *not* the first, '='. for a non-greedy match, you'd need pcre
(and using pcre generally improves performance anyway):
txt = "foo='FOO' bar='BAR'"
gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
# "foo bar"
gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
# "FOO BAR"
vQ
More information about the R-devel
mailing list