[R] Re gular Expression help

Gabor Grothendieck ggrothendieck at gmail.com
Sat Nov 8 23:58:05 CET 2008


I'll see if I can speed it up if I get some time.  I personally use it on
relatively short strings where the low absolute time means that
the higher relative time your comparisons show are not that
important.


On Sat, Nov 8, 2008 at 5:33 PM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
> Gabor Grothendieck wrote:
>> I suspect strapply is only relatively slow on short strings where
>> it doesn't matter anyways since for long strings performance would
>> likely be dominated by the underlying regexp operations.  I know that
>> users are using the package for very long strings since I once had
>> to lift the 25,000 character limit since I had complaints about that.
>> The expressiveness and brevity of strapply (it would be shortest if it
>> were not for the length of the word simplify) offset any disadvantage
>> in my view.
>>
> ok, the attached tests against strings of length 30000 where the
> character that matches is precisely the last one.  (gabor3 is dummy,
> because i had no patience to wait over a minute...)  note that the
> strapply version is still approximately an order of magnitude slower.
>
> with the original script and string lenght (m) set to 10000, the
> strapply version is two orders of magnitude slower.
>
> it might be that the test is poor, though -- design a smart test where
> strapply wins ;)
> (related to the original problem, of course.)
>
> vQ
>
> generate = function(n, m)
>        replicate(n, paste(paste(sample(letters[c(1:15, 18:26)], m, replace=TRUE), collapse=""), sample(letters[16:17], 1), sep=""))
>
> tests = list(
>
>        wacek =
>        function(data) {
>                p = grep("^[^pq]*p", data)
>                list(p=data[p], q=data[-p])
>        },
>
>        gabor1 =
>        function(data)
>                sapply(c(p="^[^pq]*p", q="^[^pq]*q"), grep, x=data, value=TRUE),
>
>        gabor2 =
>        function(data)
>                tapply(data, sub("^[^pq]*p(.).*", "\\1", data), c),
>
>        gabor3 =
>        function(data) 0,
>                # tapply(data, substr(gsub("[^pq]", "", data), 1, 1), c),
>
>        gabor4 =
>        { library(gsubfn); function(data)
>                tapply(data, strapply(data, "^[^pq]*(.)", simplify=c), c) }
> )
>
> data = generate(10,30000)
> for (name in names(tests)) {
>        cat(name, ":\n", sep="")
>        print(system.time(replicate(30,tests[[name]](data)))) }
>
>



More information about the R-help mailing list