[R] help with regexpr in gsub
Marc Schwartz
marc_schwartz at comcast.net
Thu Jan 18 14:04:15 CET 2007
On Thu, 2007-01-18 at 04:49 +0000, Prof Brian Ripley wrote:
> One thing to watch with experiments like this is that the locale will
> matter. Character operations will be faster in a single-byte locale (as
> used here) than in a variable-byte locale (and I suspect Seth and Marc
> used UTF-8), and the relative speeds may alter. Also, the PCRE regexps
> are often much faster, and 'useBytes' can be much faster with ASCII data
> in UTF-8.
>
> For example:
>
> # R-devel, x86_64 Linux
> library(GO)
> goids <- ls(GOTERM)
> gids <- paste(goids, "ISS", sep=".")
> go.ids <- rep(gids, 10)
> > length(go.ids)
> [1] 205950
>
> # In en_GB (single byte)
>
> > system.time(z <- gsub("[.].*", "", go.ids))
> user system elapsed
> 1.709 0.004 1.716
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
> user system elapsed
> 0.241 0.004 0.246
>
> > system.time(z <- gsub('\\..+$','', go.ids))
> user system elapsed
> 2.254 0.018 2.286
> > system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
> user system elapsed
> 2.890 0.002 2.895
> > system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
> user system elapsed
> 2.716 0.002 2.721
> > system.time(z <- sub("\\..+", "", go.ids))
> user system elapsed
> 1.724 0.001 1.725
> > system.time(z <- substr(go.ids, 0, 10))
> user system elapsed
> 0.084 0.000 0.084
>
> # in en_GB.utf8
>
> > system.time(z <- gsub("[.].*", "", go.ids))
> user system elapsed
> 1.689 0.020 1.712
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
> user system elapsed
> 0.718 0.017 0.736
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE, useByte=TRUE))
> user system elapsed
> 0.243 0.001 0.244
>
> > system.time(z <- gsub('\\..+$','', go.ids))
> user system elapsed
> 2.509 0.024 2.537
> > system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
> user system elapsed
> 3.772 0.004 3.779
> > system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
> user system elapsed
> 4.088 0.007 4.099
> > system.time(z <- sub("\\..+", "", go.ids))
> user system elapsed
> 1.920 0.004 1.927
> > system.time(z <- substr(go.ids, 0, 10))
> user system elapsed
> 0.096 0.002 0.098
>
> substr still wins, but by a much smaller margin.
<snip>
Just to confirm Prof. Ripley's suspicion, that I am indeed running in
en_US.UTF-8.
Thanks for taking the time to point this out.
Best regards,
Marc
More information about the R-help
mailing list