[Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Patrick Aboyoun paboyoun at fhcrc.org
Wed Jul 1 05:01:28 CEST 2009


Simon,
I checked in a speedup for coverage calculations to the IRanges  
package. It should be about 30 - 50% faster now. On my laptop, it took  
around 50 seconds to calculate the coverage of 29 million ranges over  
a 1e8 sequence domain on my laptop. Under the old coverage code, this  
calculation took about 100 seconds.

> suppressMessages(library(IRanges))
> N <- 29e6
> set.seed(0)
> x <- IRanges(start=sample(1e8 - 36, N, replace = TRUE), width = 36)
> system.time(coverage(x, width = 1e8))
    user  system elapsed
  49.747   4.262  54.305
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] IRanges_1.3.28



Patrick


Quoting Simon Anders <anders at ebi.ac.uk>:

> Hi
>
> Patrick Aboyoun wrote:
>> Simon,
>> Could you provide some profiling information to show where the   
>> bottlenecks are?
>
> I don't know if there is really a clear bottleneck. 9 minutes to
> calculate the coverage of 29 mio reads is 20 seconds per mio reads;
> this is probably what the coverage function always needed. So, in the
> code given in my mail, the summing up of the GenomeData objects is just
> awkward but not a performance penalty.
>
>> I am also wondering if I should be building up the
>> functionality for RleList, which could have `+` and other Math
>> operations. We have a lot of classes in the Sequence space and it is
>> not clear yet which classes are going to be part of the winning
>> solution.
>
> I'd say that this is the main issue. I discover new classes every day.
> You just mentioned 'RleList', Michael mentions 'GenomeDataList', and
> Martin has another way to go again.
>
> I'm sorry to say that, at least for me, this has become hopelessly
> confusing, and I imagine that many other users fell the same. You write
> that "it is not clear yet which classes are going to be part of the
> winning solution" and I completely agree that it makes more sense to
> have a few good classes rather than adding functionality to any class
> on demand. So, maybe don't bother with a `+` operation for now.
>
> Best regards
>   Simon



More information about the Bioc-sig-sequencing mailing list