[Rd] proposed changes to RSiteSearch
Romain Francois
romain.francois at dbmail.com
Fri May 8 19:25:42 CEST 2009
Philippe Grosjean wrote:
>
> ..............................................<°}))><........
> ) ) ) ) )
> ( ( ( ( ( Prof. Philippe Grosjean
> ) ) ) ) )
> ( ( ( ( ( Numerical Ecology of Aquatic Systems
> ) ) ) ) ) Mons-Hainaut University, Belgium
> ( ( ( ( (
> ..............................................................
>
> Romain Francois wrote:
>> strapply in package gsubfn brings elegance here:
>
> Don't! If you write functions to be used in a package to be included
> somehow in the base or recommended packages, then, your package should
> only depends on... base (preferably), or recommended packages itself!
Definitely.
>
> So, forget about gsubfn, unless it is itself incorporated in base or
> utils.
> Best,
>
> Philippe
>
>> > txt <- '<foo>bar</foo>'
>> > rx <- "<(.*?)>(.*?)</(.*?)>"
>> > strapply( txt, rx, c , perl = T )
>> [[1]]
>> [1] "foo" "bar" "foo"
>>
>> Too bad you have to pay this on performance:
>>
>> > txt <- rep( '<foo>bar</foo>', 1000 )
>> > rx <- "<(.*?)>(.*?)</(.*?)>"
>> > system.time( out <- strapply( txt, rx, c , perl = T ) )
>> user system elapsed
>> 2.923 0.005 3.063
>> > system.time( out2 <- sapply( paste('\\', 1:3, sep=''), function(x){
>> + gsub(rx, x, txt, perl=TRUE)
>> + } ) )
>> user system elapsed
>> 0.011 0.000 0.011
>>
>> Not sure what the right play is
>>
>>
>> Wacek Kusnierczyk wrote:
>>> Romain Francois wrote:
>>>
>>>> txt <- grep( '^<tr.*<td align=right.*<a', readLines( url ), value =
>>>> TRUE )
>>>> rx <- '^.*?<a href="(.*?)">(.*?)</a>.*<td>(.*?)</td>.*$'
>>>> out <- data.frame(
>>>> url = gsub( rx, "\\1", txt ),
>>>> group = gsub( rx, "\\2", txt ),
>>>> description = gsub( rx, "\\3", txt ),
>>>>
>>>
>>> looking at this bit of your code, i wonder why gsub is not vectorized
>>> for the pattern and replacement arguments, although it is for the x
>>> argument. the three lines above could be collapsed to just one with a
>>> vectorized gsub:
>>>
>>> gsubm = function(pattern, replacement, x, ...)
>>> mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE,
>>> gsub, pattern=pattern, replacement=replacement, x=x, ...)
>>>
>>> for example, given the sample data
>>>
>>> txt = '<foo>foo</foo><bar>bar</bar>'
>>> rx = '<(.*?)>(.*?)</(.*?)>'
>>>
>>> the sequence
>>>
>>> open = gsub(rx, '\\1', txt, perl=TRUE)
>>> content = gsub(rx, '\\2', txt, perl=TRUE)
>>> close = gsub(rx, '\\3', txt, perl=TRUE)
>>>
>>> print(list(open, content, close))
>>> could be replaced with
>>>
>>> data = structure(names=c('open', 'content', 'close'),
>>> gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE))
>>>
>>> print(data)
>>>
>>> surely, a call to mapply does not improve performance, but a
>>> source-level fix should not be too difficult; unfortunately, i can't
>>> find myself willing to struggle with r sources right now.
>>>
>>>
>>> note also that .*? does not work as a non-greedy .* with the default
>>> regex engine, e.g.,
>>>
>>> txt = "foo='FOO' bar='BAR'"
>>> gsub("(.*?)='(.*?)'", '\\1', txt)
>>> # "foo='FOO' bar"
>>> gsub("(.*?)='(.*?)'", '\\2', txt)
>>> # "BAR"
>>>
>>> because the first .*? matches everyithng up to and exclusive of the
>>> second, *not* the first, '='. for a non-greedy match, you'd need pcre
>>> (and using pcre generally improves performance anyway):
>>>
>>> txt = "foo='FOO' bar='BAR'"
>>> gsub("(.*?)='(.*?)'", '\\1', txt, perl=TRUE)
>>> # "foo bar"
>>> gsub("(.*?)='(.*?)'", '\\2', txt, perl=TRUE)
>>> # "FOO BAR"
>>>
>>> vQ
>>>
>>>
>>>
>>
>>
>
>
--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
More information about the R-devel
mailing list