[R] help with regexpr in gsub

Thu Jan 18 06:10:05 CET 2007

Thanks Brian, that advice may help speed up my regexp operations in the
future. The computer science advice offered by those of you who are more
expert is appreciated by we biologists who are primarily working more at
the level of bioinformatics. Mark

Mark W. Kimpel MD 

(317) 490-5129 Work, & Mobile

(317) 663-0513 Home (no voice mail please)

1-(317)-536-2730 FAX

-----Original Message-----
From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] 
Sent: Wednesday, January 17, 2007 11:49 PM
To: Kimpel, Mark William
Cc: marc_schwartz at comcast.net; Seth Falcon; r-help at stat.math.ethz.ch
Subject: Re: [R] help with regexpr in gsub

One thing to watch with experiments like this is that the locale will 
matter.  Character operations will be faster in a single-byte locale (as

used here) than in a variable-byte locale (and I suspect Seth and Marc 
used UTF-8), and the relative speeds may alter.  Also, the PCRE regexps 
are often much faster, and 'useBytes' can be much faster with ASCII data

in UTF-8.

For example:

# R-devel, x86_64 Linux
library(GO)
goids <- ls(GOTERM)
gids <- paste(goids, "ISS", sep=".")
go.ids <- rep(gids, 10)
> length(go.ids)
[1] 205950

# In en_GB (single byte)

> system.time(z <- gsub("[.].*", "", go.ids))
    user  system elapsed
   1.709   0.004   1.716
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
    user  system elapsed
   0.241   0.004   0.246

> system.time(z <- gsub('\\..+$','', go.ids))
    user  system elapsed
   2.254   0.018   2.286
> system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
    user  system elapsed
   2.890   0.002   2.895
> system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
    user  system elapsed
   2.716   0.002   2.721
> system.time(z <- sub("\\..+", "", go.ids))
    user  system elapsed
   1.724   0.001   1.725
> system.time(z <- substr(go.ids, 0, 10))
    user  system elapsed
   0.084   0.000   0.084

# in en_GB.utf8

> system.time(z <- gsub("[.].*", "", go.ids))
    user  system elapsed
   1.689   0.020   1.712
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
    user  system elapsed
   0.718   0.017   0.736
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE, useByte=TRUE))
    user  system elapsed
   0.243   0.001   0.244

> system.time(z <- gsub('\\..+$','', go.ids))
    user  system elapsed
   2.509   0.024   2.537
> system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
    user  system elapsed
   3.772   0.004   3.779
> system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
    user  system elapsed
   4.088   0.007   4.099
> system.time(z <- sub("\\..+", "", go.ids))
    user  system elapsed
   1.920   0.004   1.927
> system.time(z <- substr(go.ids, 0, 10))
    user  system elapsed
   0.096   0.002   0.098

substr still wins, but by a much smaller margin.

On Wed, 17 Jan 2007, Kimpel, Mark William wrote:

> Thanks for 6 ways to skin this cat! I am just beginning to learn about
> the power of regular expressions and appreciate the many examples of
how
> they can be used in this context. This knowledge will come in handy
the
> next time the number of characters is variable both before and after
the
> dot. On my machine and for my particular example, however, Seth is
> correct in that substr is by far the fastest. I had forgotten that
> substr is vectorized.
>
> Below is the output of my speed trials and sessionInfo in case anyone
is
> curious. I artificially made the go.id vector 10X its normal length to
> magnify differences. I did also check to verify that each solution
> worked as predicted, which they all did.
>
> Thanks again for your generous help, Mark
>
> length(go.ids)
> [1] 79750
>>     go.ids[1:5]
> [1] "GO:0006091.NA"  "GO:0008104.ISS" "GO:0008104.ISS" "GO:0006091.NA"
> "GO:0006091.NAS"
>>     system.time(z <- gsub("[.].*", "", go.ids))
> [1] 0.47 0.00 0.47   NA   NA
>>     system.time(z <- gsub('\\..+$','', go.ids))
> [1] 0.56 0.00 0.56   NA   NA
>>     system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
> [1] 1.08 0.00 1.09   NA   NA
>>     system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
> [1] 1.03 0.00 1.03   NA   NA
>>     system.time(z <- sub("\\..+", "", go.ids))
> [1] 0.49 0.00 0.48   NA   NA
>>     system.time(z <- substr(go.ids, 0, 10))
> [1] 0.02 0.00 0.01   NA   NA
>> sessionInfo()
> R version 2.4.1 (2006-12-18)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] "splines"   "stats"     "graphics"  "grDevices" "datasets"
"utils"
> "tools"     "methods"   "base"
>
> other attached packages:
>        rat2302 xlsReadWritePro          qvalue   affycoretools
> biomaRt           RCurl             XML         GOstats
Category
>       "1.14.0"         "1.0.6"         "1.8.0"         "1.6.0"
> "1.8.1"         "0.8-0"         "1.2-0"         "2.0.4"
"2.0.3"
>     genefilter        survival            KEGG            RBGL
> annotate              GO           graph         RWinEdt
limma
>
>       "1.12.0"          "2.30"        "1.14.1"        "1.10.0"
> "1.12.1"        "1.14.1"        "1.12.0"         "1.7-5"
"2.9.1"
>
>           affy          affyio         Biobase
>       "1.12.2"         "1.2.0"        "1.12.2"
>
> Mark W. Kimpel MD
>
>
>
> (317) 490-5129 Work, & Mobile
>
>
>
> (317) 663-0513 Home (no voice mail please)
>
> 1-(317)-536-2730 FAX
>
>
> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
> Sent: Wednesday, January 17, 2007 8:11 PM
> To: Seth Falcon
> Cc: Kimpel, Mark William; r-help at stat.math.ethz.ch
> Subject: Re: [R] help with regexpr in gsub
>
> On Wed, 2007-01-17 at 16:46 -0800, Seth Falcon wrote:
>> "Kimpel, Mark William" <mkimpel at iupui.edu> writes:
>>
>>> I have a very long vector of character strings of the format
>>> "GO:0008104.ISS" and need to strip off the dot and anything that
> follows
>>> it. There are always 10 characters before the dot. The actual
> characters
>>> and the number of them after the dot is variable.
>>>
>>> So, I would like to return in the format "GO:0008104" . I could do
> this
>>> with substr and loop over the entire vector, but I thought there
> might
>>> be a more elegant (and faster) way to do this.
>>>
>>> I have tried gsub using regular expressions without success. The
> code
>>>
>>> gsub(pattern= "\.*?" , replacement="", x=character.vector)
>>
>> I guess you want:
>>
>>     sub("([GO:0-9]+)\\..*$", "\\1", goids)
>>
>> [You don't need gsub here]
>>
>> But I don't understand why you wouldn't want to use substr.  At least
>> for me substr looks to be about 20x faster than sub for this
>> problem...
>>
>>
>>  > library(GO)
>>  > goids = ls(GOTERM)
>>  > gids = paste(goids, "ISS", sep=".")
>>  > gids[1:10]
>>    [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS"
> "GO:0000004.ISS"
>>    [5] "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000009.ISS"
> "GO:0000010.ISS"
>>    [9] "GO:0000011.ISS" "GO:0000012.ISS"
>>
>>  > system.time(z <- substr(gids, 0, 10))
>>      user  system elapsed
>>     0.008   0.000   0.007
>>  > system.time(z2 <- sub("([GO:0-9]+)\\..*$", "\\1", gids))
>>      user  system elapsed
>>     0.136   0.000   0.134
>
> I think that some of the overhead here in using sub() is due to the
> effective partitioning of the source vector, a more complex regex and
> then just returning the first element.
>
> This can be shortened to:
>
> # Note that I have 12 elements here
>> gids
> [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS"
"GO:0000004.ISS"
> [5] "GO:0000005.ISS" "GO:0000006.ISS" "GO:0000007.ISS"
"GO:0000008.ISS"
> [9] "GO:0000009.ISS" "GO:0000010.ISS" "GO:0000011.ISS"
"GO:0000012.ISS"
>
>> system.time(z2 <- sub("\\..+", "", gids))
> [1] 0 0 0 0 0
>
>> z2
> [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000004" "GO:0000005"
> [6] "GO:0000006" "GO:0000007" "GO:0000008" "GO:0000009" "GO:0000010"
> [11] "GO:0000011" "GO:0000012"
>
>
> Which would appear to be quicker than using substr().
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595