[BioC] Faster way to apply a function than endoapply on 40K GrangesList
Hervé Pagès
hpages at fhcrc.org
Thu Aug 11 19:48:11 CEST 2011
Hi Michael,
On 11-08-11 03:56 AM, Michael Lawrence wrote:
>
>
> 2011/8/10 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> Hi,
>
> Note that the "flank" method for GRanges objects is strand aware and
> in this context "end" means "3' end". If that's not what you want,
> just do something like
>
> start(grl) <- end(grl) + 1L
> end(grl) <- end(grl) + 10000L
>
> and you are done.
>
> But if you really want the "3' end", I can't think of an easy/efficient
> way, unless you do something *highly* NOT recommended like accessing
> directly a slot of your GRangesList object:
>
> <NEVER DO THIS!>
>
> grl at unlistData <- flank(grl at unlistData, 10000, start=FALSE)
>
> </NEVER DO THIS!>
>
> Of course we should provide a "flank" method for GRangesList
> objects that will do the above so the user doesn't need to use
> a dangerous/forbidden trick ;-)
>
>
>
> It does sound from his variable names that he is looking for a
> strand-aware method, while the RangesList approach proposed by Martin is
> strand insensitive (but fairly efficient due to the compression). That
> is easy to fix though:
>
> ranges(grl) <- flank(ranges(grl), 10000, start = strand(grl) == "-")
>
> Having such a flank() would be great on GRangesList.
>
> Also, a simple unlist(grl, use.names=FALSE) and then resplitting is a
> way to avoid depending on any internals, without sacrificing too much
> performance.
Yes, that's indeed cleaner. The resplitting can be a little bit tricky
though so I was thinking maybe we should have "relist" methods for
this. This would also allow to propagate all the metadata stuff (names,
metadata, elementMetadata) from the 'skeleton' arg:
gr <- unlist(grl, use.names=FALSE)
gr2 <- FUN(gr) # FUN can be any isomorphism (i.e. endo
# that preserves the length)
grl2 <- relist(gr2, skeleton=grl)
The relist part can be made really fast (as fast as the unlist part).
Cheers,
H.
>
> Michael
>
>
> H.
>
>
>
> On 11-08-10 02:48 PM, Martin Morgan wrote:
>
> On 08/10/2011 12:36 PM, Lakshmanan Iyer wrote:
>
> Hi,
> I am trying to add 10K to the end of the intervals in a
> GRangesList
> object using endoapply and flank using the comand:
> p3UTRs10K<- endoapply(p3UTRs,function (z) flank (z,
> 10000,start=FALSE));
>
>
> maybe along the lines of
>
> example(GRangesList)
> ranges(grl) = flank(ranges(grl), 10000, start=FALSE)
>
> which gives
>
> > end(grl)
> CompressedIntegerList of length 3
> [["gr1"]] 6
> [["gr2"]] 9 15
> [["gr3"]] 3 9
> > ranges(grl) = flank(ranges(grl), 1000, start=FALSE)
> > start(grl)
> CompressedIntegerList of length 3
> [["gr1"]] 7
> [["gr2"]] 10 16
> [["gr3"]] 4 10
> > end(grl)
> CompressedIntegerList of length 3
> [["gr1"]] 1006
> [["gr2"]] 1009 1015
> [["gr3"]] 1003 1009
>
> ? Martin
>
>
> I am writing to find out if someone has a suggestion for a
> faster way
> of doing this? With 40000 elements it is taking a long time..
>
> Here is what I am doing:
>
> class (p3UTRs)
> [1] "GRangesList"
> attr(,"package")
> [1] "GenomicRanges"
>
> length (p3UTRs)
> [1] 41053
>
> p3UTRs10K<- endoapply(p3UTRs,function (z) flank (z,
> 10000,start=FALSE));
>
> sessionInfo()
>
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/__en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome.Mmusculus.UCSC.mm9_1.__3.17 GenomicFeatures_1.4.0
> [3] leeBamViews_0.99.11 BSgenome_1.20.0
> [5] Biobase_2.12.1 Rsamtools_1.4.1
> [7] Biostrings_2.20.0 GenomicRanges_1.4.3
> [9] IRanges_1.10.0 rtracklayer_1.12.0
> [11] RCurl_1.5-0 bitops_1.0-4.1
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.8.0 DBI_0.2-5 RSQLite_0.9-4 tools_2.13.0 XML_3.2-0
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list