[Bioc-devel] Combining Ordinary List of GRanges Optimisation

Hervé Pagès hpages at fhcrc.org
Wed Jan 9 00:39:19 CET 2013


Michael,

According to my timings, here is where c() spends its time on GRanges
objects with 1 meta col:

   - merging the seqinfo: 1.5%
   - combining the seqnames: 5.9%
   - combining the ranges: 12.4%
   - combining the strand: 2%
   - rbinding the mcols: 78.2%

It seems that any additional meta col would have an impact that is only
half the impact of the first meta col.

When doing

   df <- DataFrame(conservation=1:1000000*1e-6)
   rbind(df, df)

most of the time (> 90%) is spent on the following line (line 78,
in IRanges/R/DataFrame-utils.R)

   ans <- do.call(DataFrame, cols)

where 'cols' is a (named) list of length 1 containing a numeric vector
of length 2 millions.

Finally this:

   > system.time(ans <- do.call(DataFrame, cols))
      user  system elapsed
     3.684   0.012   3.701
   > system.time(ans2 <- DataFrame(cols))
      user  system elapsed
     3.828   0.008   3.842
   > system.time(ans3 <- DataFrame(conservation=cols[[1]]))
      user  system elapsed
     0.024   0.032   0.057

   > identical(ans, ans2)
   [1] TRUE
   > identical(ans, ans3)
   [1] TRUE

is intriguing. From a naive perspective, it doesn't sound that

   DataFrame(cols)

should need to do a lot more work than

  DataFrame(conservation=cols[[1]])

but apparently it does. May be there is room for some speed improvements
here...

H.


On 01/08/2013 02:05 PM, Michael Lawrence wrote:
> That GRanges only had one column, so I'm hoping that's not a lot of
> overhead. The merging of the thousands of Seqinfo objects is probably
> the issue. Any way to make that n-ary instead of a Reduce() over a
> binary merge?
>
> Michael
>
>
> On Tue, Jan 8, 2013 at 10:44 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Dario,
>
>
>     On 01/06/2013 07:00 PM, Dario Strbenac wrote:
>
>             Are you asking if you can rewrite your code to work faster,
>             or are you asking if the BioC devs need to improve the code
>             to be faster?
>
>
>         I was suggesting that maybe the c function for GRanges could be
>         optimised.
>
>             Another would be manually splitting each GRanges objects
>             into its components: seqnames, IRanges, strand, and
>             metadata. Then concatenate these components and build one
>             big GRanges object.
>
>
>         This approach gives:
>
>              user  system elapsed
>            63.488  11.092  74.786
>
>
>     I think this is more or less what 'do.call(c, blockRanges)' would give
>     you if all your GRanges objects were naked i.e. if they had no meta
>     columns.
>
>
>
>         which by using c was previously:
>
>              user  system elapsed
>         935.770  23.657 961.952
>
>
>     By default c() will also combine the meta columns which can be
>     expensive if you have a lot of them and/or if some of them are
>     complicated objects. You can call c() with 'ignore.mcols=TRUE'
>     if you don't need to propagate the meta columns. Which, in the
>     context of do.call(), translates to something like:
>
>        allRanges <- do.call(c, c(blockRanges, list(ignore.mcols=TRUE)))
>
>     IMPORTANT NOTE, related to this thread on the Bioconductor list:
>
>     https://stat.ethz.ch/__pipermail/bioconductor/2012-__November/049567.html
>     <https://stat.ethz.ch/pipermail/bioconductor/2012-November/049567.html>
>
>     In short: if we ask the R core guys to change the implicit c() generic,
>     my understanding is that it won't be possible to support additional
>     args in "c" methods anymore, like the 'ignore.mcols' arg of the method
>     for GenomicRanges objects. Should take the time to discuss this before
>     I proceed?
>
>     Thanks,
>     H.
>
>
>
>         Thanks for the tip. I now remember using this approach at some
>         time in the past.
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>     _________________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list