[Rd] proposed changes to RSiteSearch

Fri May 8 19:25:42 CEST 2009

Philippe Grosjean wrote:
>
> ..............................................<°}))><........
>  ) ) ) ) )
> ( ( ( ( (    Prof. Philippe Grosjean
>  ) ) ) ) )
> ( ( ( ( (    Numerical Ecology of Aquatic Systems
>  ) ) ) ) )   Mons-Hainaut University, Belgium
> ( ( ( ( (
> ..............................................................
>
> Romain Francois wrote:
>> strapply in package gsubfn brings elegance here:
>
> Don't! If you write functions to be used in a package to be included 
> somehow in the base or recommended packages, then, your package should 
> only depends on... base (preferably), or recommended packages itself!

Definitely.

>
> So, forget about gsubfn, unless it is itself incorporated in base or 
> utils.
> Best,
>
> Philippe
>
>>  > txt <- '<foo>bar</foo>'
>>  > rx <- "<(.*?)>(.*?)</(.*?)>"
>>  > strapply( txt, rx, c , perl = T )
>> [[1]]
>> [1] "foo" "bar" "foo"
>>
>> Too bad you have to pay this on performance:
>>
>>  > txt <- rep( '<foo>bar</foo>', 1000 )
>>  > rx <- "<(.*?)>(.*?)</(.*?)>"
>>  > system.time( out <- strapply( txt, rx, c , perl = T ) )
>>   user  system elapsed
>>  2.923   0.005   3.063
>>  > system.time( out2 <- sapply( paste('\\', 1:3, sep=''), function(x){
>> + gsub(rx, x, txt, perl=TRUE)
>> + } ) )
>>   user  system elapsed
>>  0.011   0.000   0.011
>>
>> Not sure what the right play is
>>
>>
>> Wacek Kusnierczyk wrote:
>>> Romain Francois wrote:
>>>  
>>>>    txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
>>>> TRUE )
>>>>      rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
>>>>    out <- data.frame(
>>>>        url = gsub( rx, "\\1", txt ),
>>>>        group = gsub( rx, "\\2", txt ),
>>>>        description = gsub( rx, "\\3", txt ),
>>>>     
>>>
>>> looking at this bit of your code, i wonder why gsub is not vectorized
>>> for the pattern and replacement arguments, although it is for the x
>>> argument.  the three lines above could be collapsed to just one with a
>>> vectorized gsub:
>>>
>>>     gsubm = function(pattern, replacement, x, ...)
>>>        mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
>>>            gsub, pattern=pattern, replacement=replacement, x=x, ...)
>>>
>>> for example, given the sample data
>>>
>>>     txt = '<foo>foo</foo><bar>bar</bar>'
>>>     rx = '<(.*?)>(.*?)</(.*?)>'
>>>
>>> the sequence
>>>
>>>     open = gsub(rx, '\\1', txt, perl=TRUE)
>>>     content = gsub(rx, '\\2', txt, perl=TRUE)
>>>     close = gsub(rx, '\\3', txt, perl=TRUE)
>>>
>>>     print(list(open, content, close))
>>>    could be replaced with
>>>
>>>     data = structure(names=c('open', 'content', 'close'),
>>>         gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))
>>>
>>>     print(data)
>>>
>>> surely, a call to mapply does not improve performance, but a
>>> source-level fix should not be too difficult;  unfortunately, i can't
>>> find myself willing to struggle with r sources right now.
>>>
>>>
>>> note also that .*? does not work as a non-greedy .* with the default
>>> regex engine, e.g.,
>>>
>>>     txt = "foo='FOO' bar='BAR'"
>>>     gsub("(.*?)='(.*?)'", '\\1', txt)
>>>     # "foo='FOO' bar"
>>>     gsub("(.*?)='(.*?)'", '\\2', txt)
>>>     # "BAR"
>>>
>>> because the first .*? matches everyithng up to and exclusive of the
>>> second, *not* the first, '='.  for a non-greedy match, you'd need pcre
>>> (and using pcre generally improves performance anyway):
>>>
>>>     txt = "foo='FOO' bar='BAR'"
>>>     gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
>>>     # "foo bar"
>>>     gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
>>>     # "FOO BAR"
>>>
>>> vQ
>>>
>>>
>>>   
>>
>>
>
>

-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr