[Bioc-devel] [devteam-bioc] Very slow when operate GRangesList

Valerie Obenchain vobencha at fhcrc.org
Tue Aug 27 22:49:29 CEST 2013


Thanks Jianhong for reporting this.

Changes implemented in IRanges 1.19.27:
- RleList() constructor now has default 'compress=TRUE'.
- seqselect,Vector-method lapply() loop was replaced with direct subset.

New timings:

## generic subset function
fun0 <- function(x) x[500:1]

## GRangesList with RleList as metadata col
grll <- GRanges(seqnames="chr1",
                 IRanges(start=1:500, width=2),
                 someInfo=rep(RleList("*"), 500))
grr <- split(grll, 1:500)
 > microbenchmark(fun0(grr), times=10)
Unit: milliseconds
       expr      min       lq   median      uq      max neval
  fun0(grr) 28.88062 29.31157 30.58494 31.4393 32.26367    10

Median is now 0.031 seconds compared to the previous 1.635.

>>               > system.time(grr<- grr[500:1])
>>                  user  system elapsed
>>                 1.622   0.013   1.635



Valerie


On 08/23/2013 11:17 AM, Michael Lawrence wrote:
>
>
>
> On Fri, Aug 23, 2013 at 8:41 AM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Hi Michael,
>
>     Martin and I have been discussing this. In addition to the fix you
>     suggest, what do you think of changing the default to
>     compressed=TRUE for the RleList constructor? Rle is the only one of
>     the AtomicLists with default FALSE. Was there a reason for this when
>     it was first implemented?
>
>
> I'm guessing Patrick did that because we always used Rles for coverage,
> and RleList for per-chromosome coverage. Also, there might be some
> overhead in that Rle runs in the unlistData can cross list elements.
>
> About my fix, the only downside would be if the range widths were much
> larger than the size of the vector, e.g., a highly compressed Rle,
> selected with chromosome-size ranges. Then the as.integer(ir) is big
> compared to the data. Otherwise, it's way faster.
>
>
>     Val
>
>
>
>
>     On 08/22/2013 07:34 PM, Maintainer wrote:
>
>         Hi,
>
>         SimpleLists are slow in this situation, basically because the
>         underlying
>         seqselect is slow, due to this loop:
>
>                       x <- do.call(c, lapply(seq_len(length(ir)),
>         function(i)
>         window(x,
>                           start = start(ir)[i], width = width(ir)[i])))
>
>         Am I missing something or could this become a simple
>         x[as.integer(ir)]?
>
>         In the meantime, using CompressedLists is the way to go. So for an
>         RleList, you need to pass compress=TRUE to the constructor.
>
>
>         On Wed, Aug 21, 2013 at 8:30 AM, Ou, Jianhong
>         <Jianhong.Ou at umassmed.edu <mailto:Jianhong.Ou at umassmed.edu>
>         <mailto:Jianhong.Ou at umassmed.__edu
>         <mailto:Jianhong.Ou at umassmed.edu>>> wrote:
>
>              Hi,
>
>              When I use big set of GrangesList, I found it become very
>         slow when
>              metadata contain AtomicList. e.g.
>
>               > grll <- GRanges(seqnames="chr1", ranges=IRanges(start=1:500,
>              width=2), someInfo=rep(RleList("*"), 500))
>               > grr <- split(grll, 1:500)
>               > grl <- as.list(grr)
>               > system.time(grl<- grl[500:1])
>                  user  system elapsed
>                     0       0       0
>               > system.time(grr<- grr[500:1])
>                  user  system elapsed
>                 1.622   0.013   1.635
>               > grll <- GRanges(seqnames="chr1", ranges=IRanges(start=1:500,
>              width=2))
>               > grr <- split(grll, 1:500)
>               > grl <- as.list(grr)
>               > system.time(grl<- grl[500:1])
>                  user  system elapsed
>                     0       0       0
>               > system.time(grr<- grr[500:1])
>                  user  system elapsed
>                 0.029   0.001   0.030
>               > sessionInfo()
>              R Under development (unstable) (2013-07-23 r63392)
>              Platform: x86_64-apple-darwin12.4.0 (64-bit)
>
>              locale:
>              [1]
>         en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
>
>              attached base packages:
>              [1] parallel  stats     graphics  grDevices utils     datasets
>                methods   base
>
>              other attached packages:
>              [1] GenomicRanges_1.13.36 XVector_0.1.0         IRanges_1.19.24
>                 BiocGenerics_0.7.3
>
>              loaded via a namespace (and not attached):
>              [1] stats4_3.1.0 tools_3.1.0
>
>              Is there any method to improve this?
>
>              Yours sincerely,
>
>              Jianhong Ou
>
>              LRB 670A
>              Program in Gene Function and Expression
>              364 Plantation Street Worcester,
>              MA 01605
>
>                       [[alternative HTML version deleted]]
>
>              _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         <mailto:Bioc-devel at r-project.__org
>         <mailto:Bioc-devel at r-project.org>> mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
>
>         ____________________________________________________________________________
>         devteam-bioc mailing list
>         To unsubscribe from this mailing list send a blank email to
>         devteam-bioc-leave at lists.__fhcrc.org
>         <mailto:devteam-bioc-leave at lists.fhcrc.org>
>         You can also unsubscribe or change your personal options at
>         https://lists.fhcrc.org/__mailman/listinfo/devteam-bioc
>         <https://lists.fhcrc.org/mailman/listinfo/devteam-bioc>
>
>
>



More information about the Bioc-devel mailing list