[BioC] Why is *ply-ing over a GRangesList much slower than *ply-ing over an IRangesList?
Patrick Aboyoun
paboyoun at fhcrc.org
Wed Aug 25 19:58:36 CEST 2010
Steve,
I haven't profiled the code yet to know what is going on, but I will
address your followup question.
I have a feeling that the GRangesList concept will be growing over
time and I am not sure what the tipping point will be for changes in
code to occur. I see two issues related to GRangesList. The first
being its internal storage (as you mentioned) and the second being its
semantics (are the ranges/intervals contained within each of the
elements "grouped" as exons within a transcript or are the
ranges/intervals considered to be independent entities as collections
of tracks for a genome browser).
Patrick
Quoting Steve Lianoglou <mailinglist.honeypot at gmail.com>:
> Hi Michael,
>
> On Wed, Aug 25, 2010 at 10:21 AM, Michael Lawrence
> <lawrence.michael at gene.com> wrote:
>> My guess is that your GRangesList is compressed, whereas the IRangesList is
>> uncompressed. Extracting an element from a compressed list will be slower
>> due to the compression.
>
> Actually, the IRangesList from the example above is also compressed:
>
> R> is(irl)
> [1] "CompressedIRangesList" "IRangesList" "CompressedList"
> [4] "RangesList" "Sequence" "Annotated"
>
> So I'm not sure that is what's causing the speed difference, right?
>
> I wrote this portion below before I checked if `irl` was compressed or
> not, but I'm curious about it, so I'll keep the question, assuming
> that there will be some significant speed difference between iterating
> over compressed lists anyway:
>
> My next question was if there was anyway to have an uncompressed
> GRangesList, so I went poking around the IRanges/GenomicRanges code.
>
> It seems the answer to that is no, since GRangesList extends/contains
> CompressedList ... right?
>
> Would it be (technically) possible to have something like
> CompressedGRangesList and a "normal" GRangesList -- analogous to how
> we currently have an IRangesList and CompressedIRangesList ... or is
> there some other reason that all GRangesList must be CompressedLists?
>
> Thanks,
> -steve
>
>
>>
>> Michael
>>
>> On Tue, Aug 24, 2010 at 7:31 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Looping using any of the *ply (lapply, sapply, seqapply, etc.) seems
>>> to be significantly slower when you are iterating over a GRangesList
>>> vs. an IRangesList:
>>>
>>> R> library(GenomicFeatures)
>>> R> txdb <- loadFeatures(system.file("extdata",
>>> "UCSC_knownGene_sample.sqlite",
>>> package="GenomicFeatures"))
>>> R> xcripts <- transcriptsBy(txdb, 'gene')
>>> R> system.time(l1 <- sapply(xcripts, length))
>>> user system elapsed
>>> 2.298 0.003 2.302
>>>
>>> irl <- IRangesList(lapply(xcripts, ranges))
>>> system.time(l2 <- sapply(irl, length))
>>> user system elapsed
>>> 0.047 0.001 0.049
>>>
>>> R> identical(l1, l2)
>>> [1] TRUE
>>>
>>> I was curious if this is known/expected behavior and it's unavoidable, or
>>> .. ?
>>>
>>> Thanks,
>>> -steve
>>>
>>> R> sessionInfo()
>>> R version 2.12.0 Under development (unstable) (2010-08-21 r52791)
>>> Platform: i386-apple-darwin10.4.0/i386 (32-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] org.Hs.eg.db_2.4.1 RSQLite_0.9-2 DBI_0.2-5
>>> AnnotationDbi_1.11.4
>>> [5] Biobase_2.9.0 GenomicFeatures_1.1.11 GenomicRanges_1.1.20
>>> IRanges_1.7.21
>>>
>>> loaded via a namespace (and not attached):
>>> [1] BSgenome_1.17.6 Biostrings_2.17.29 RCurl_1.4-3 XML_3.1-1
>>> biomaRt_2.5.1
>>> [6] rtracklayer_1.9.7 tools_2.12.0
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>> | Memorial Sloan-Kettering Cancer Center
>>> | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list