[Bioc-devel] Combining Ordinary List of GRanges Optimisation

Ryan C. Thompson rct at thompsonclan.org
Mon Jan 7 18:18:08 CET 2013


With such a huge difference, I would wonder if the "c" method for 
GRanges objects is doing N-1 pairwise merges instead of a single N-way 
merge.

On Mon 07 Jan 2013 09:08:28 AM PST, Michael Lawrence wrote:
> Would be interesting to do some profiling. Could be the merging of the
> sequence info, or the rbind on the meta DataFrames (even with one column,
> could be some overhead here).
>
> Michael
>
>
> On Mon, Jan 7, 2013 at 12:31 AM, Hahne, Florian
> <florian.hahne at novartis.com>wrote:
>
>> Hi Dario,
>> the most efficient way to transform between list-like structures of
>> GRanges objects and single GRanges is to use the GRangesList class in the
>> first place. Not sure how you came up with your initial list, but assuming
>> that blockRanges is already a GRangesList object, unlist(blockRanges) will
>> give you a unified GRanges object in matters of seconds, even for gigantic
>> GRangesLists. Constructs like do.call are notoriously slow if your list is
>> very long, and the c method for the GRanges class needs to do quite a lot
>> of work on each element (e.g., checking seqnames and elementMetadata)
>> which is not necessary when dealing with GRangesList where certain
>> assumptions are being made up front. Also my understanding of the
>> GRangesList class is that it essentially is a GRanges object with a bunch
>> of pointers attached to it to deal with the individual subsets.
>>
>> Hope that helps,
>> Florian
>> --
>>
>>
>>
>>
>>
>>
>> On 1/7/13 3:00 AM, "Dario Strbenac" <D.Strbenac at garvan.org.au> wrote:
>>
>>> Hello,
>>>
>>> For a not so large list of GRanges:
>>>
>>>> length(blockRanges)
>>> [1] 4029
>>>> class(blockRanges)
>>> [1] "list"
>>>
>>> Which don't have an unreasonable number of elements in them:
>>>
>>>> summary(sapply(blockRanges, length))
>>>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>       1     961   20710   55210   77680  759600
>>>
>>> Combining them takes 15 minutes:
>>>
>>>> system.time(allRanges <- do.call(c, blockRanges))
>>> sessionInfo()
>>>    user  system elapsed
>>> 935.770  23.657 961.952
>>>
>>>> head(blockRanges[[1]])
>>> GRanges with 6 ranges and 1 metadata column:
>>>       seqnames         ranges strand | conservation
>>>          <Rle>      <IRanges>  <Rle> |    <numeric>
>>>   [1]     chr1 [10918, 10918]      * |        0.064
>>>   [2]     chr1 [10919, 10919]      * |        0.056
>>>   [3]     chr1 [10920, 10920]      * |        0.064
>>>   [4]     chr1 [10921, 10921]      * |        0.056
>>>   [5]     chr1 [10922, 10922]      * |        0.064
>>>   [6]     chr1 [10923, 10923]      * |        0.064
>>>   ---
>>>   seqlengths:
>>>    chr1
>>>      NA
>>>
>>> Could this code be faster ?
>>>
>>>> sessionInfo()
>>> R version 2.15.2 (2012-10-26)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.10.5 IRanges_1.16.4       BiocGenerics_0.4.0
>>>
>>> --------------------------------------
>>> Dario Strbenac
>>> PhD Student
>>> University of Sydney
>>> Camperdown NSW 2050
>>> Australia
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list