[Rd] proposed changes to RSiteSearch

Romain Francois romain.francois at dbmail.com
Fri May 8 17:11:55 CEST 2009


strapply in package gsubfn brings elegance here:

 > txt <- '<foo>bar</foo>'
 > rx <- "<(.*?)>(.*?)</(.*?)>"
 > strapply( txt, rx, c , perl = T )
[[1]]
[1] "foo" "bar" "foo"

Too bad you have to pay this on performance:

 > txt <- rep( '<foo>bar</foo>', 1000 )
 > rx <- "<(.*?)>(.*?)</(.*?)>"
 > system.time( out <- strapply( txt, rx, c , perl = T ) )
   user  system elapsed
  2.923   0.005   3.063
 > system.time( out2 <- sapply( paste('\\', 1:3, sep=''), function(x){
+ gsub(rx, x, txt, perl=TRUE)
+ } ) )
   user  system elapsed
  0.011   0.000   0.011

Not sure what the right play is


Wacek Kusnierczyk wrote:
> Romain Francois wrote:
>   
>>    txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
>> TRUE )
>>      rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
>>    out <- data.frame(
>>        url = gsub( rx, "\\1", txt ),
>>        group = gsub( rx, "\\2", txt ),
>>        description = gsub( rx, "\\3", txt ),
>>     
>
> looking at this bit of your code, i wonder why gsub is not vectorized
> for the pattern and replacement arguments, although it is for the x
> argument.  the three lines above could be collapsed to just one with a
> vectorized gsub:
>
>     gsubm = function(pattern, replacement, x, ...)
>        mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
>            gsub, pattern=pattern, replacement=replacement, x=x, ...)
>
> for example, given the sample data
>
>     txt = '<foo>foo</foo><bar>bar</bar>'
>     rx = '<(.*?)>(.*?)</(.*?)>'
>
> the sequence
>
>     open = gsub(rx, '\\1', txt, perl=TRUE)
>     content = gsub(rx, '\\2', txt, perl=TRUE)
>     close = gsub(rx, '\\3', txt, perl=TRUE)
>
>     print(list(open, content, close))
>    
> could be replaced with
>
>     data = structure(names=c('open', 'content', 'close'),
>         gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))
>
>     print(data)
>
> surely, a call to mapply does not improve performance, but a
> source-level fix should not be too difficult;  unfortunately, i can't
> find myself willing to struggle with r sources right now.
>
>
> note also that .*? does not work as a non-greedy .* with the default
> regex engine, e.g.,
>
>     txt = "foo='FOO' bar='BAR'"
>     gsub("(.*?)='(.*?)'", '\\1', txt)
>     # "foo='FOO' bar"
>     gsub("(.*?)='(.*?)'", '\\2', txt)
>     # "BAR"
>
> because the first .*? matches everyithng up to and exclusive of the
> second, *not* the first, '='.  for a non-greedy match, you'd need pcre
> (and using pcre generally improves performance anyway):
>
>     txt = "foo='FOO' bar='BAR'"
>     gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
>     # "foo bar"
>     gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
>     # "FOO BAR"
>
> vQ
>
>
>   


-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr



More information about the R-devel mailing list