[Bioc-devel] Changes in the %in% function for DNAStringSet?

Hervé Pagès hpages at fhcrc.org
Fri May 4 03:56:13 CEST 2012


Hi Nico,

Last week I did some improvements/reorganization of the match(),
%in%, duplicated(), and unique() stuff in 
IRanges/GenomicRanges/Biostrings, and apparently forgot to define the 
"%in%" method
for ANY,Vector. Thanks for the catch!

This is fixed in IRanges 1.15.8. FWIW I also added an "%in%" method
for Vector,ANY so now this works too:

   > DNAStringSet(c("TTGCGA","ATGRCT","ACASTG")) %in% 
c("TTGCGA","ATGGCT","ACACTG")
   [1]  TRUE FALSE FALSE

<rant mode>

It is so sad that we have to redefine "%in%" methods that do exactly
the same thing as base::`%in%`:

   > base::`%in%`
   function (x, table)
   match(x, table, nomatch = 0L) > 0L
   <environment: namespace:base>

just because base::`%in%` cannot dispatch on the appropriate
"match" method. A well-known issue of the way generics, methods
and NAMESPACE interact with each other... but still an unfortunate
one.

</rant mode>

The good news is that we have on our TODO list to explicitly define
the match() and %in% generics in BiocGenerics so there will be an
opportunity to overwrite the "%in%" default method:

   setMethod("%in%", c("ANY", "ANY"), function (x, table) match(x, 
table, nomatch = 0L) > 0L)

(I'm still hesitant about this though. What could be the drawbacks
of overwriting the default method?)

Also last week at the same time I did the changes on match() and
family, I also reimplemented the "match" method for DNAStringSet
objects (which is called when either 'x' or 'table' or both are
DNAStringSet). The new implementation is in Biostrings 2.25.3.
It uses a hash-based algorithm instead of the quicksort-based algo
that was used so far. The resulting speedup varies a lot depending
on the sizes of 'x' and 'table', and will typically be important
(10x or more) for big (i.e. > 1M elements) DNAStringSet objects.

This benefits directly %in%, duplicated() and unique() on
DNAStringSet objects.

With Biostrings 2.25.3 (Bioc 2.11):

   > library(Biostrings)
   > probes <- DNAStringSet(hgu133aprobe)
   > system.time(isdup <- duplicated(probes))
      user  system elapsed
     0.048   0.000   0.050

With Bioc <= 2.10:

   > system.time(isdup <- duplicated(probes))
      user  system elapsed
     0.232   0.000   0.233

Finally I should mention that, even though the hash function I use
for DNAStringSet (and RNAStringSet, AAStringSet, BStringSet) is the
same as the function used in base R for hashing the strings of a
standard character vector, calling match(), %in%, duplicated() or
unique() on a standard character vector is still slightly faster
(2x) than on a DNAStringSet. This can probably be explained by the
fact that all the strings in all the character vectors defined in
a session are pre-hashed i.e. hashed the 1st time the string is
created and the result of the hash stored in the "global CHARSXP
hash table".

Cheers,
H.

On 05/01/2012 05:28 AM, Nicolas Delhomme wrote:
> Hi all,
>
> In R 2.15.0, Bioc 2.10, the following works:
>
> library(Biostrings)
> c("TTGCGA","ATGGCT","ACACTG") %in% DNAStringSet(c("TTGCGA","ATGRCT","ACASTG"))
> [1]  TRUE FALSE FALSE
>> sessionInfo()R version 2.15.0 (2012-03-30)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.24.1  IRanges_1.14.2     BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] stats4_2.15.0
>
>
> While in Bioc 2.11 it fails:
>
> Error in match(x, table, nomatch = 0L) :
>    'match' requires vector arguments
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.25.3  IRanges_1.15.7     BiocGenerics_0.3.0
>
> loaded via a namespace (and not attached):
> [1] stats4_2.15.0
>
>
>
> I'd just like to know if that is that a change of API or not. If yes, I'd need to adapt my code that currently fails building.
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list