[Bioc-devel] Changes in the %in% function for DNAStringSet?

Fri May 4 09:31:40 CEST 2012

Hi Hervé,

Thanks a lot for fixing it and for the super detailed description! Learned a lot :-) And thanks for the benchmarking, that's really useful as well!

I can't really think of any drawbacks there, but my %in% usage is certainly limited. What do the R developer guys say about it? Wouldn't it make sense to have it that way in base R?

Cheers,

Nico

On May 4, 2012, at 3:56 AM, Hervé Pagès wrote:

> Hi Nico,
> 
> Last week I did some improvements/reorganization of the match(),
> %in%, duplicated(), and unique() stuff in IRanges/GenomicRanges/Biostrings, and apparently forgot to define the "%in%" method
> for ANY,Vector. Thanks for the catch!
> 
> This is fixed in IRanges 1.15.8. FWIW I also added an "%in%" method
> for Vector,ANY so now this works too:
> 
>  > DNAStringSet(c("TTGCGA","ATGRCT","ACASTG")) %in% c("TTGCGA","ATGGCT","ACACTG")
>  [1]  TRUE FALSE FALSE
> 
> <rant mode>
> 
> It is so sad that we have to redefine "%in%" methods that do exactly
> the same thing as base::`%in%`:
> 
>  > base::`%in%`
>  function (x, table)
>  match(x, table, nomatch = 0L) > 0L
>  <environment: namespace:base>
> 
> just because base::`%in%` cannot dispatch on the appropriate
> "match" method. A well-known issue of the way generics, methods
> and NAMESPACE interact with each other... but still an unfortunate
> one.
> 
> </rant mode>
> 
> The good news is that we have on our TODO list to explicitly define
> the match() and %in% generics in BiocGenerics so there will be an
> opportunity to overwrite the "%in%" default method:
> 
>  setMethod("%in%", c("ANY", "ANY"), function (x, table) match(x, table, nomatch = 0L) > 0L)
> 
> (I'm still hesitant about this though. What could be the drawbacks
> of overwriting the default method?)
> 
> Also last week at the same time I did the changes on match() and
> family, I also reimplemented the "match" method for DNAStringSet
> objects (which is called when either 'x' or 'table' or both are
> DNAStringSet). The new implementation is in Biostrings 2.25.3.
> It uses a hash-based algorithm instead of the quicksort-based algo
> that was used so far. The resulting speedup varies a lot depending
> on the sizes of 'x' and 'table', and will typically be important
> (10x or more) for big (i.e. > 1M elements) DNAStringSet objects.
> 
> This benefits directly %in%, duplicated() and unique() on
> DNAStringSet objects.
> 
> With Biostrings 2.25.3 (Bioc 2.11):
> 
>  > library(Biostrings)
>  > probes <- DNAStringSet(hgu133aprobe)
>  > system.time(isdup <- duplicated(probes))
>     user  system elapsed
>    0.048   0.000   0.050
> 
> With Bioc <= 2.10:
> 
>  > system.time(isdup <- duplicated(probes))
>     user  system elapsed
>    0.232   0.000   0.233
> 
> Finally I should mention that, even though the hash function I use
> for DNAStringSet (and RNAStringSet, AAStringSet, BStringSet) is the
> same as the function used in base R for hashing the strings of a
> standard character vector, calling match(), %in%, duplicated() or
> unique() on a standard character vector is still slightly faster
> (2x) than on a DNAStringSet. This can probably be explained by the
> fact that all the strings in all the character vectors defined in
> a session are pre-hashed i.e. hashed the 1st time the string is
> created and the result of the hash stored in the "global CHARSXP
> hash table".
> 
> Cheers,
> H.
> 
> On 05/01/2012 05:28 AM, Nicolas Delhomme wrote:
>> Hi all,
>> 
>> In R 2.15.0, Bioc 2.10, the following works:
>> 
>> library(Biostrings)
>> c("TTGCGA","ATGGCT","ACACTG") %in% DNAStringSet(c("TTGCGA","ATGRCT","ACASTG"))
>> [1]  TRUE FALSE FALSE
>>> sessionInfo()R version 2.15.0 (2012-03-30)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>> 
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] Biostrings_2.24.1  IRanges_1.14.2     BiocGenerics_0.2.0
>> 
>> loaded via a namespace (and not attached):
>> [1] stats4_2.15.0
>> 
>> 
>> While in Bioc 2.11 it fails:
>> 
>> Error in match(x, table, nomatch = 0L) :
>>   'match' requires vector arguments
>>> sessionInfo()
>> R version 2.15.0 (2012-03-30)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>> 
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] Biostrings_2.25.3  IRanges_1.15.7     BiocGenerics_0.3.0
>> 
>> loaded via a namespace (and not attached):
>> [1] stats4_2.15.0
>> 
>> 
>> 
>> I'd just like to know if that is that a change of API or not. If yes, I'd need to adapt my code that currently fails building.
>> 
>> Cheers,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> 
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany