[Rd] proposed changes to RSiteSearch

Philippe Grosjean phgrosjean at sciviews.org
Fri May 8 18:07:06 CEST 2009


..............................................<°}))><........
  ) ) ) ) )
( ( ( ( (    Prof. Philippe Grosjean
  ) ) ) ) )
( ( ( ( (    Numerical Ecology of Aquatic Systems
  ) ) ) ) )   Mons-Hainaut University, Belgium
( ( ( ( (
..............................................................

Romain Francois wrote:
> strapply in package gsubfn brings elegance here:

Don't! If you write functions to be used in a package to be included 
somehow in the base or recommended packages, then, your package should 
only depends on... base (preferably), or recommended packages itself!

So, forget about gsubfn, unless it is itself incorporated in base or utils.
Best,

Philippe

>  > txt <- '<foo>bar</foo>'
>  > rx <- "<(.*?)>(.*?)</(.*?)>"
>  > strapply( txt, rx, c , perl = T )
> [[1]]
> [1] "foo" "bar" "foo"
> 
> Too bad you have to pay this on performance:
> 
>  > txt <- rep( '<foo>bar</foo>', 1000 )
>  > rx <- "<(.*?)>(.*?)</(.*?)>"
>  > system.time( out <- strapply( txt, rx, c , perl = T ) )
>   user  system elapsed
>  2.923   0.005   3.063
>  > system.time( out2 <- sapply( paste('\\', 1:3, sep=''), function(x){
> + gsub(rx, x, txt, perl=TRUE)
> + } ) )
>   user  system elapsed
>  0.011   0.000   0.011
> 
> Not sure what the right play is
> 
> 
> Wacek Kusnierczyk wrote:
>> Romain Francois wrote:
>>  
>>>    txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
>>> TRUE )
>>>      rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
>>>    out <- data.frame(
>>>        url = gsub( rx, "\\1", txt ),
>>>        group = gsub( rx, "\\2", txt ),
>>>        description = gsub( rx, "\\3", txt ),
>>>     
>>
>> looking at this bit of your code, i wonder why gsub is not vectorized
>> for the pattern and replacement arguments, although it is for the x
>> argument.  the three lines above could be collapsed to just one with a
>> vectorized gsub:
>>
>>     gsubm = function(pattern, replacement, x, ...)
>>        mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
>>            gsub, pattern=pattern, replacement=replacement, x=x, ...)
>>
>> for example, given the sample data
>>
>>     txt = '<foo>foo</foo><bar>bar</bar>'
>>     rx = '<(.*?)>(.*?)</(.*?)>'
>>
>> the sequence
>>
>>     open = gsub(rx, '\\1', txt, perl=TRUE)
>>     content = gsub(rx, '\\2', txt, perl=TRUE)
>>     close = gsub(rx, '\\3', txt, perl=TRUE)
>>
>>     print(list(open, content, close))
>>    could be replaced with
>>
>>     data = structure(names=c('open', 'content', 'close'),
>>         gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))
>>
>>     print(data)
>>
>> surely, a call to mapply does not improve performance, but a
>> source-level fix should not be too difficult;  unfortunately, i can't
>> find myself willing to struggle with r sources right now.
>>
>>
>> note also that .*? does not work as a non-greedy .* with the default
>> regex engine, e.g.,
>>
>>     txt = "foo='FOO' bar='BAR'"
>>     gsub("(.*?)='(.*?)'", '\\1', txt)
>>     # "foo='FOO' bar"
>>     gsub("(.*?)='(.*?)'", '\\2', txt)
>>     # "BAR"
>>
>> because the first .*? matches everyithng up to and exclusive of the
>> second, *not* the first, '='.  for a non-greedy match, you'd need pcre
>> (and using pcre generally improves performance anyway):
>>
>>     txt = "foo='FOO' bar='BAR'"
>>     gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
>>     # "foo bar"
>>     gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
>>     # "FOO BAR"
>>
>> vQ
>>
>>
>>   
> 
>



More information about the R-devel mailing list