[Rd] proposed changes to RSiteSearch
Romain Francois
romain.francois at dbmail.com
Fri May 8 17:11:55 CEST 2009
strapply in package gsubfn brings elegance here:
> txt <- '<foo>bar</foo>'
> rx <- "<(.*?)>(.*?)</(.*?)>"
> strapply( txt, rx, c , perl = T )
[[1]]
[1] "foo" "bar" "foo"
Too bad you have to pay this on performance:
> txt <- rep( '<foo>bar</foo>', 1000 )
> rx <- "<(.*?)>(.*?)</(.*?)>"
> system.time( out <- strapply( txt, rx, c , perl = T ) )
user system elapsed
2.923 0.005 3.063
> system.time( out2 <- sapply( paste('\\', 1:3, sep=''), function(x){
+ gsub(rx, x, txt, perl=TRUE)
+ } ) )
user system elapsed
0.011 0.000 0.011
Not sure what the right play is
Wacek Kusnierczyk wrote:
> Romain Francois wrote:
>
>> txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
>> TRUE )
>> rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
>> out <- data.frame(
>> url = gsub( rx, "\\1", txt ),
>> group = gsub( rx, "\\2", txt ),
>> description = gsub( rx, "\\3", txt ),
>>
>
> looking at this bit of your code, i wonder why gsub is not vectorized
> for the pattern and replacement arguments, although it is for the x
> argument. the three lines above could be collapsed to just one with a
> vectorized gsub:
>
> gsubm = function(pattern, replacement, x, ...)
> mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
> gsub, pattern=pattern, replacement=replacement, x=x, ...)
>
> for example, given the sample data
>
> txt = '<foo>foo</foo><bar>bar</bar>'
> rx = '<(.*?)>(.*?)</(.*?)>'
>
> the sequence
>
> open = gsub(rx, '\\1', txt, perl=TRUE)
> content = gsub(rx, '\\2', txt, perl=TRUE)
> close = gsub(rx, '\\3', txt, perl=TRUE)
>
> print(list(open, content, close))
>
> could be replaced with
>
> data = structure(names=c('open', 'content', 'close'),
> gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))
>
> print(data)
>
> surely, a call to mapply does not improve performance, but a
> source-level fix should not be too difficult; unfortunately, i can't
> find myself willing to struggle with r sources right now.
>
>
> note also that .*? does not work as a non-greedy .* with the default
> regex engine, e.g.,
>
> txt = "foo='FOO' bar='BAR'"
> gsub("(.*?)='(.*?)'", '\\1', txt)
> # "foo='FOO' bar"
> gsub("(.*?)='(.*?)'", '\\2', txt)
> # "BAR"
>
> because the first .*? matches everyithng up to and exclusive of the
> second, *not* the first, '='. for a non-greedy match, you'd need pcre
> (and using pcre generally improves performance anyway):
>
> txt = "foo='FOO' bar='BAR'"
> gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
> # "foo bar"
> gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
> # "FOO BAR"
>
> vQ
>
>
>
--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
More information about the R-devel
mailing list