[BioC] Faster way to apply a function than endoapply on 40K GrangesList

Hervé Pagès hpages at fhcrc.org
Thu Aug 11 19:48:11 CEST 2011


Hi Michael,

On 11-08-11 03:56 AM, Michael Lawrence wrote:
>
>
> 2011/8/10 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi,
>
>     Note that the "flank" method for GRanges objects is strand aware and
>     in this context "end" means "3' end". If that's not what you want,
>     just do something like
>
>       start(grl) <- end(grl) + 1L
>       end(grl) <- end(grl) + 10000L
>
>     and you are done.
>
>     But if you really want the "3' end", I can't think of an easy/efficient
>     way, unless you do something *highly* NOT recommended like accessing
>     directly a slot of your GRangesList object:
>
>     <NEVER DO THIS!>
>
>       grl at unlistData <- flank(grl at unlistData, 10000, start=FALSE)
>
>     </NEVER DO THIS!>
>
>     Of course we should provide a "flank" method for GRangesList
>     objects that will do the above so the user doesn't need to use
>     a dangerous/forbidden trick ;-)
>
>
>
> It does sound from his variable names that he is looking for a
> strand-aware method, while the RangesList approach proposed by Martin is
> strand insensitive (but fairly efficient due to the compression). That
> is easy to fix though:
>
> ranges(grl) <- flank(ranges(grl), 10000, start = strand(grl) == "-")
>
> Having such a flank() would be great on GRangesList.
>
> Also, a simple unlist(grl, use.names=FALSE) and then resplitting is a
> way to avoid depending on any internals, without sacrificing too much
> performance.

Yes, that's indeed cleaner. The resplitting can be a little bit tricky
though so I was thinking maybe we should have "relist" methods for
this. This would also allow to propagate all the metadata stuff (names,
metadata, elementMetadata) from the 'skeleton' arg:

     gr <- unlist(grl, use.names=FALSE)
     gr2 <- FUN(gr)  # FUN can be any isomorphism (i.e. endo
                     # that preserves the length)
     grl2 <- relist(gr2, skeleton=grl)

The relist part can be made really fast (as fast as the unlist part).

Cheers,
H.

>
> Michael
>
>
>     H.
>
>
>
>     On 11-08-10 02:48 PM, Martin Morgan wrote:
>
>         On 08/10/2011 12:36 PM, Lakshmanan Iyer wrote:
>
>             Hi,
>             I am trying to add 10K to the end of the intervals in a
>             GRangesList
>             object using endoapply and flank using the comand:
>             p3UTRs10K<- endoapply(p3UTRs,function (z) flank (z,
>             10000,start=FALSE));
>
>
>         maybe along the lines of
>
>         example(GRangesList)
>         ranges(grl) = flank(ranges(grl), 10000, start=FALSE)
>
>         which gives
>
>          > end(grl)
>         CompressedIntegerList of length 3
>         [["gr1"]] 6
>         [["gr2"]] 9 15
>         [["gr3"]] 3 9
>          > ranges(grl) = flank(ranges(grl), 1000, start=FALSE)
>          > start(grl)
>         CompressedIntegerList of length 3
>         [["gr1"]] 7
>         [["gr2"]] 10 16
>         [["gr3"]] 4 10
>          > end(grl)
>         CompressedIntegerList of length 3
>         [["gr1"]] 1006
>         [["gr2"]] 1009 1015
>         [["gr3"]] 1003 1009
>
>         ? Martin
>
>
>             I am writing to find out if someone has a suggestion for a
>             faster way
>             of doing this? With 40000 elements it is taking a long time..
>
>             Here is what I am doing:
>
>             class (p3UTRs)
>             [1] "GRangesList"
>             attr(,"package")
>             [1] "GenomicRanges"
>
>             length (p3UTRs)
>             [1] 41053
>
>             p3UTRs10K<- endoapply(p3UTRs,function (z) flank (z,
>             10000,start=FALSE));
>
>                 sessionInfo()
>
>             R version 2.13.0 (2011-04-13)
>             Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
>             locale:
>             [1] en_US.UTF-8/en_US.UTF-8/C/C/__en_US.UTF-8/en_US.UTF-8
>
>             attached base packages:
>             [1] stats graphics grDevices utils datasets methods base
>
>             other attached packages:
>             [1] BSgenome.Mmusculus.UCSC.mm9_1.__3.17 GenomicFeatures_1.4.0
>             [3] leeBamViews_0.99.11 BSgenome_1.20.0
>             [5] Biobase_2.12.1 Rsamtools_1.4.1
>             [7] Biostrings_2.20.0 GenomicRanges_1.4.3
>             [9] IRanges_1.10.0 rtracklayer_1.12.0
>             [11] RCurl_1.5-0 bitops_1.0-4.1
>
>             loaded via a namespace (and not attached):
>             [1] biomaRt_2.8.0 DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 XML_3.2-0
>
>
>
>             _________________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>             Search the archives:
>             http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list