[R] help with regexpr in gsub

Thu Jan 18 05:49:07 CET 2007

One thing to watch with experiments like this is that the locale will 
matter.  Character operations will be faster in a single-byte locale (as 
used here) than in a variable-byte locale (and I suspect Seth and Marc 
used UTF-8), and the relative speeds may alter.  Also, the PCRE regexps 
are often much faster, and 'useBytes' can be much faster with ASCII data 
in UTF-8.

For example:

# R-devel, x86_64 Linux
library(GO)
goids <- ls(GOTERM)
gids <- paste(goids, "ISS", sep=".")
go.ids <- rep(gids, 10)
> length(go.ids)
[1] 205950

# In en_GB (single byte)

> system.time(z <- gsub("[.].*", "", go.ids))
    user  system elapsed
   1.709   0.004   1.716
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
    user  system elapsed
   0.241   0.004   0.246

> system.time(z <- gsub('\\..+$','', go.ids))
    user  system elapsed
   2.254   0.018   2.286
> system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
    user  system elapsed
   2.890   0.002   2.895
> system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
    user  system elapsed
   2.716   0.002   2.721
> system.time(z <- sub("\\..+", "", go.ids))
    user  system elapsed
   1.724   0.001   1.725
> system.time(z <- substr(go.ids, 0, 10))
    user  system elapsed
   0.084   0.000   0.084

# in en_GB.utf8

> system.time(z <- gsub("[.].*", "", go.ids))
    user  system elapsed
   1.689   0.020   1.712
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
    user  system elapsed
   0.718   0.017   0.736
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE, useByte=TRUE))
    user  system elapsed
   0.243   0.001   0.244

> system.time(z <- gsub('\\..+$','', go.ids))
    user  system elapsed
   2.509   0.024   2.537
> system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
    user  system elapsed
   3.772   0.004   3.779
> system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
    user  system elapsed
   4.088   0.007   4.099
> system.time(z <- sub("\\..+", "", go.ids))
    user  system elapsed
   1.920   0.004   1.927
> system.time(z <- substr(go.ids, 0, 10))
    user  system elapsed
   0.096   0.002   0.098

substr still wins, but by a much smaller margin.

On Wed, 17 Jan 2007, Kimpel, Mark William wrote:

> Thanks for 6 ways to skin this cat! I am just beginning to learn about
> the power of regular expressions and appreciate the many examples of how
> they can be used in this context. This knowledge will come in handy the
> next time the number of characters is variable both before and after the
> dot. On my machine and for my particular example, however, Seth is
> correct in that substr is by far the fastest. I had forgotten that
> substr is vectorized.
>
> Below is the output of my speed trials and sessionInfo in case anyone is
> curious. I artificially made the go.id vector 10X its normal length to
> magnify differences. I did also check to verify that each solution
> worked as predicted, which they all did.
>
> Thanks again for your generous help, Mark
>
> length(go.ids)
> [1] 79750
>>     go.ids[1:5]
> [1] "GO:0006091.NA"  "GO:0008104.ISS" "GO:0008104.ISS" "GO:0006091.NA"
> "GO:0006091.NAS"
>>     system.time(z <- gsub("[.].*", "", go.ids))
> [1] 0.47 0.00 0.47   NA   NA
>>     system.time(z <- gsub('\\..+$','', go.ids))
> [1] 0.56 0.00 0.56   NA   NA
>>     system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
> [1] 1.08 0.00 1.09   NA   NA
>>     system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
> [1] 1.03 0.00 1.03   NA   NA
>>     system.time(z <- sub("\\..+", "", go.ids))
> [1] 0.49 0.00 0.48   NA   NA
>>     system.time(z <- substr(go.ids, 0, 10))
> [1] 0.02 0.00 0.01   NA   NA
>> sessionInfo()
> R version 2.4.1 (2006-12-18)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] "splines"   "stats"     "graphics"  "grDevices" "datasets"  "utils"
> "tools"     "methods"   "base"
>
> other attached packages:
>        rat2302 xlsReadWritePro          qvalue   affycoretools
> biomaRt           RCurl             XML         GOstats        Category
>       "1.14.0"         "1.0.6"         "1.8.0"         "1.6.0"
> "1.8.1"         "0.8-0"         "1.2-0"         "2.0.4"         "2.0.3"
>     genefilter        survival            KEGG            RBGL
> annotate              GO           graph         RWinEdt           limma
>
>       "1.12.0"          "2.30"        "1.14.1"        "1.10.0"
> "1.12.1"        "1.14.1"        "1.12.0"         "1.7-5"         "2.9.1"
>
>           affy          affyio         Biobase
>       "1.12.2"         "1.2.0"        "1.12.2"
>
> Mark W. Kimpel MD
>
>
>
> (317) 490-5129 Work, & Mobile
>
>
>
> (317) 663-0513 Home (no voice mail please)
>
> 1-(317)-536-2730 FAX
>
>
> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
> Sent: Wednesday, January 17, 2007 8:11 PM
> To: Seth Falcon
> Cc: Kimpel, Mark William; r-help at stat.math.ethz.ch
> Subject: Re: [R] help with regexpr in gsub
>
> On Wed, 2007-01-17 at 16:46 -0800, Seth Falcon wrote:
>> "Kimpel, Mark William" <mkimpel at iupui.edu> writes:
>>
>>> I have a very long vector of character strings of the format
>>> "GO:0008104.ISS" and need to strip off the dot and anything that
> follows
>>> it. There are always 10 characters before the dot. The actual
> characters
>>> and the number of them after the dot is variable.
>>>
>>> So, I would like to return in the format "GO:0008104" . I could do
> this
>>> with substr and loop over the entire vector, but I thought there
> might
>>> be a more elegant (and faster) way to do this.
>>>
>>> I have tried gsub using regular expressions without success. The
> code
>>>
>>> gsub(pattern= "\.*?" , replacement="", x=character.vector)
>>
>> I guess you want:
>>
>>     sub("([GO:0-9]+)\\..*$", "\\1", goids)
>>
>> [You don't need gsub here]
>>
>> But I don't understand why you wouldn't want to use substr.  At least
>> for me substr looks to be about 20x faster than sub for this
>> problem...
>>
>>
>>  > library(GO)
>>  > goids = ls(GOTERM)
>>  > gids = paste(goids, "ISS", sep=".")
>>  > gids[1:10]
>>    [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS"
> "GO:0000004.ISS"
>>    [5] "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000009.ISS"
> "GO:0000010.ISS"
>>    [9] "GO:0000011.ISS" "GO:0000012.ISS"
>>
>>  > system.time(z <- substr(gids, 0, 10))
>>      user  system elapsed
>>     0.008   0.000   0.007
>>  > system.time(z2 <- sub("([GO:0-9]+)\\..*$", "\\1", gids))
>>      user  system elapsed
>>     0.136   0.000   0.134
>
> I think that some of the overhead here in using sub() is due to the
> effective partitioning of the source vector, a more complex regex and
> then just returning the first element.
>
> This can be shortened to:
>
> # Note that I have 12 elements here
>> gids
> [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS" "GO:0000004.ISS"
> [5] "GO:0000005.ISS" "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000008.ISS"
> [9] "GO:0000009.ISS" "GO:0000010.ISS" "GO:0000011.ISS" "GO:0000012.ISS"
>
>> system.time(z2 <- sub("\\..+", "", gids))
> [1] 0 0 0 0 0
>
>> z2
> [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000004" "GO:0000005"
> [6] "GO:0000006" "GO:0000007" "GO:0000008" "GO:0000009" "GO:0000010"
> [11] "GO:0000011" "GO:0000012"
>
>
> Which would appear to be quicker than using substr().
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595