[Bioc-devel] Combining Ordinary List of GRanges Optimisation
Hervé Pagès
hpages at fhcrc.org
Wed Jan 9 00:39:19 CET 2013
Michael,
According to my timings, here is where c() spends its time on GRanges
objects with 1 meta col:
- merging the seqinfo: 1.5%
- combining the seqnames: 5.9%
- combining the ranges: 12.4%
- combining the strand: 2%
- rbinding the mcols: 78.2%
It seems that any additional meta col would have an impact that is only
half the impact of the first meta col.
When doing
df <- DataFrame(conservation=1:1000000*1e-6)
rbind(df, df)
most of the time (> 90%) is spent on the following line (line 78,
in IRanges/R/DataFrame-utils.R)
ans <- do.call(DataFrame, cols)
where 'cols' is a (named) list of length 1 containing a numeric vector
of length 2 millions.
Finally this:
> system.time(ans <- do.call(DataFrame, cols))
user system elapsed
3.684 0.012 3.701
> system.time(ans2 <- DataFrame(cols))
user system elapsed
3.828 0.008 3.842
> system.time(ans3 <- DataFrame(conservation=cols[[1]]))
user system elapsed
0.024 0.032 0.057
> identical(ans, ans2)
[1] TRUE
> identical(ans, ans3)
[1] TRUE
is intriguing. From a naive perspective, it doesn't sound that
DataFrame(cols)
should need to do a lot more work than
DataFrame(conservation=cols[[1]])
but apparently it does. May be there is room for some speed improvements
here...
H.
On 01/08/2013 02:05 PM, Michael Lawrence wrote:
> That GRanges only had one column, so I'm hoping that's not a lot of
> overhead. The merging of the thousands of Seqinfo objects is probably
> the issue. Any way to make that n-ary instead of a Reduce() over a
> binary merge?
>
> Michael
>
>
> On Tue, Jan 8, 2013 at 10:44 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Dario,
>
>
> On 01/06/2013 07:00 PM, Dario Strbenac wrote:
>
> Are you asking if you can rewrite your code to work faster,
> or are you asking if the BioC devs need to improve the code
> to be faster?
>
>
> I was suggesting that maybe the c function for GRanges could be
> optimised.
>
> Another would be manually splitting each GRanges objects
> into its components: seqnames, IRanges, strand, and
> metadata. Then concatenate these components and build one
> big GRanges object.
>
>
> This approach gives:
>
> user system elapsed
> 63.488 11.092 74.786
>
>
> I think this is more or less what 'do.call(c, blockRanges)' would give
> you if all your GRanges objects were naked i.e. if they had no meta
> columns.
>
>
>
> which by using c was previously:
>
> user system elapsed
> 935.770 23.657 961.952
>
>
> By default c() will also combine the meta columns which can be
> expensive if you have a lot of them and/or if some of them are
> complicated objects. You can call c() with 'ignore.mcols=TRUE'
> if you don't need to propagate the meta columns. Which, in the
> context of do.call(), translates to something like:
>
> allRanges <- do.call(c, c(blockRanges, list(ignore.mcols=TRUE)))
>
> IMPORTANT NOTE, related to this thread on the Bioconductor list:
>
> https://stat.ethz.ch/__pipermail/bioconductor/2012-__November/049567.html
> <https://stat.ethz.ch/pipermail/bioconductor/2012-November/049567.html>
>
> In short: if we ask the R core guys to change the implicit c() generic,
> my understanding is that it won't be possible to support additional
> args in "c" methods anymore, like the 'ignore.mcols' arg of the method
> for GenomicRanges objects. Should take the time to discuss this before
> I proceed?
>
> Thanks,
> H.
>
>
>
> Thanks for the tip. I now remember using this approach at some
> time in the past.
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list