[R] help with regexpr in gsub
Prof Brian Ripley
ripley at stats.ox.ac.uk
Thu Jan 18 05:49:07 CET 2007
One thing to watch with experiments like this is that the locale will
matter. Character operations will be faster in a single-byte locale (as
used here) than in a variable-byte locale (and I suspect Seth and Marc
used UTF-8), and the relative speeds may alter. Also, the PCRE regexps
are often much faster, and 'useBytes' can be much faster with ASCII data
in UTF-8.
For example:
# R-devel, x86_64 Linux
library(GO)
goids <- ls(GOTERM)
gids <- paste(goids, "ISS", sep=".")
go.ids <- rep(gids, 10)
> length(go.ids)
[1] 205950
# In en_GB (single byte)
> system.time(z <- gsub("[.].*", "", go.ids))
user system elapsed
1.709 0.004 1.716
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
user system elapsed
0.241 0.004 0.246
> system.time(z <- gsub('\\..+$','', go.ids))
user system elapsed
2.254 0.018 2.286
> system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
user system elapsed
2.890 0.002 2.895
> system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
user system elapsed
2.716 0.002 2.721
> system.time(z <- sub("\\..+", "", go.ids))
user system elapsed
1.724 0.001 1.725
> system.time(z <- substr(go.ids, 0, 10))
user system elapsed
0.084 0.000 0.084
# in en_GB.utf8
> system.time(z <- gsub("[.].*", "", go.ids))
user system elapsed
1.689 0.020 1.712
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
user system elapsed
0.718 0.017 0.736
> system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE, useByte=TRUE))
user system elapsed
0.243 0.001 0.244
> system.time(z <- gsub('\\..+$','', go.ids))
user system elapsed
2.509 0.024 2.537
> system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
user system elapsed
3.772 0.004 3.779
> system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
user system elapsed
4.088 0.007 4.099
> system.time(z <- sub("\\..+", "", go.ids))
user system elapsed
1.920 0.004 1.927
> system.time(z <- substr(go.ids, 0, 10))
user system elapsed
0.096 0.002 0.098
substr still wins, but by a much smaller margin.
On Wed, 17 Jan 2007, Kimpel, Mark William wrote:
> Thanks for 6 ways to skin this cat! I am just beginning to learn about
> the power of regular expressions and appreciate the many examples of how
> they can be used in this context. This knowledge will come in handy the
> next time the number of characters is variable both before and after the
> dot. On my machine and for my particular example, however, Seth is
> correct in that substr is by far the fastest. I had forgotten that
> substr is vectorized.
>
> Below is the output of my speed trials and sessionInfo in case anyone is
> curious. I artificially made the go.id vector 10X its normal length to
> magnify differences. I did also check to verify that each solution
> worked as predicted, which they all did.
>
> Thanks again for your generous help, Mark
>
> length(go.ids)
> [1] 79750
>> go.ids[1:5]
> [1] "GO:0006091.NA" "GO:0008104.ISS" "GO:0008104.ISS" "GO:0006091.NA"
> "GO:0006091.NAS"
>> system.time(z <- gsub("[.].*", "", go.ids))
> [1] 0.47 0.00 0.47 NA NA
>> system.time(z <- gsub('\\..+$','', go.ids))
> [1] 0.56 0.00 0.56 NA NA
>> system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
> [1] 1.08 0.00 1.09 NA NA
>> system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
> [1] 1.03 0.00 1.03 NA NA
>> system.time(z <- sub("\\..+", "", go.ids))
> [1] 0.49 0.00 0.48 NA NA
>> system.time(z <- substr(go.ids, 0, 10))
> [1] 0.02 0.00 0.01 NA NA
>> sessionInfo()
> R version 2.4.1 (2006-12-18)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] "splines" "stats" "graphics" "grDevices" "datasets" "utils"
> "tools" "methods" "base"
>
> other attached packages:
> rat2302 xlsReadWritePro qvalue affycoretools
> biomaRt RCurl XML GOstats Category
> "1.14.0" "1.0.6" "1.8.0" "1.6.0"
> "1.8.1" "0.8-0" "1.2-0" "2.0.4" "2.0.3"
> genefilter survival KEGG RBGL
> annotate GO graph RWinEdt limma
>
> "1.12.0" "2.30" "1.14.1" "1.10.0"
> "1.12.1" "1.14.1" "1.12.0" "1.7-5" "2.9.1"
>
> affy affyio Biobase
> "1.12.2" "1.2.0" "1.12.2"
>
> Mark W. Kimpel MD
>
>
>
> (317) 490-5129 Work, & Mobile
>
>
>
> (317) 663-0513 Home (no voice mail please)
>
> 1-(317)-536-2730 FAX
>
>
> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
> Sent: Wednesday, January 17, 2007 8:11 PM
> To: Seth Falcon
> Cc: Kimpel, Mark William; r-help at stat.math.ethz.ch
> Subject: Re: [R] help with regexpr in gsub
>
> On Wed, 2007-01-17 at 16:46 -0800, Seth Falcon wrote:
>> "Kimpel, Mark William" <mkimpel at iupui.edu> writes:
>>
>>> I have a very long vector of character strings of the format
>>> "GO:0008104.ISS" and need to strip off the dot and anything that
> follows
>>> it. There are always 10 characters before the dot. The actual
> characters
>>> and the number of them after the dot is variable.
>>>
>>> So, I would like to return in the format "GO:0008104" . I could do
> this
>>> with substr and loop over the entire vector, but I thought there
> might
>>> be a more elegant (and faster) way to do this.
>>>
>>> I have tried gsub using regular expressions without success. The
> code
>>>
>>> gsub(pattern= "\.*?" , replacement="", x=character.vector)
>>
>> I guess you want:
>>
>> sub("([GO:0-9]+)\\..*$", "\\1", goids)
>>
>> [You don't need gsub here]
>>
>> But I don't understand why you wouldn't want to use substr. At least
>> for me substr looks to be about 20x faster than sub for this
>> problem...
>>
>>
>> > library(GO)
>> > goids = ls(GOTERM)
>> > gids = paste(goids, "ISS", sep=".")
>> > gids[1:10]
>> [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS"
> "GO:0000004.ISS"
>> [5] "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000009.ISS"
> "GO:0000010.ISS"
>> [9] "GO:0000011.ISS" "GO:0000012.ISS"
>>
>> > system.time(z <- substr(gids, 0, 10))
>> user system elapsed
>> 0.008 0.000 0.007
>> > system.time(z2 <- sub("([GO:0-9]+)\\..*$", "\\1", gids))
>> user system elapsed
>> 0.136 0.000 0.134
>
> I think that some of the overhead here in using sub() is due to the
> effective partitioning of the source vector, a more complex regex and
> then just returning the first element.
>
> This can be shortened to:
>
> # Note that I have 12 elements here
>> gids
> [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS" "GO:0000004.ISS"
> [5] "GO:0000005.ISS" "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000008.ISS"
> [9] "GO:0000009.ISS" "GO:0000010.ISS" "GO:0000011.ISS" "GO:0000012.ISS"
>
>> system.time(z2 <- sub("\\..+", "", gids))
> [1] 0 0 0 0 0
>
>> z2
> [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000004" "GO:0000005"
> [6] "GO:0000006" "GO:0000007" "GO:0000008" "GO:0000009" "GO:0000010"
> [11] "GO:0000011" "GO:0000012"
>
>
> Which would appear to be quicker than using substr().
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list