[BioC] IRanges: Request for a "step" argument in runsum
Hervé Pagès
hpages at fhcrc.org
Fri Jul 8 10:47:54 CEST 2011
Hi Michael,
On 11-06-29 03:29 PM, Michael Lawrence wrote:
>
>
> 2011/6/28 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> Hi Michael, Arnaud,
>
>
> On 11-05-25 08:21 AM, Arnaud Amzallag wrote:
>
> Thank you Michael,
>
> for some email filters reasons I saw your reply only now.
>
> I recon that in my case that would have been much smoother if
> sum() would
> call viewSums() by default and I agree that "It's more intuitive
> to think of
> an RleViews like an RleList rather than an IntegerList.". I
> would support
> that change.
>
>
> I've recently made some changes in that direction in IRanges devel.
> This subtle shift in the nature of an RleViews (from IntegerList
> to RleList) has some deep consequences on the hierarchy of classes
> in IRanges. The most important one is that the Views class doesn't
> extend the IRanges class anymore. Views is now a direct subclass
> of List. This means that you cannot directly manipulate a Views object
> as if it was an IRanges object but you must extract its ranges (with
> ranges()) first.
>
>
> This is pretty unfortunate. We lose the ability to e.g. store the
> coverage inside RangedData (because a ViewsList is a RangesList). Is it
> really necessary for what was proposed?
Because of the inheritance tree. When Views was a subclass of IRanges we
had:
RleViews -> Views -> IRanges -> IntegerList -> List
but [[ on a Views object would not return an integer vector like it does
on an IRanges object (and the reason IRanges is a child of IntegerList
is precisely because of what [[ does on an IRanges object).
So a Views object should not be seen as an IntegerList object, but just
as a List object with an elementType that depends on the particular
type of Views object (e.g. elementType="Rle" for RleViews). Then [[ on
RleViews can return an Rle without contradiction.
H.
>
>
> This is a work-in-progress and a few things still need to be polished.
>
> Also I agree that we should just have "min", "max", "sum", "mean" etc
> methods for Views objects. No need for viewMins, viewMaxs, viewSums etc
> I'll change this.
>
> Cheers,
> H.
>
>
> Also it is possible that before I was summing the values of the
> Rle and did
> not notice the difference because my Rle was made of a lot of
> very short Rle
> lengths.
>
> Arnaud
>
> On Tue, May 10, 2011 at 8:44 AM, Michael
> Lawrence<lawrence.michael at __gene.com
> <mailto:lawrence.michael at gene.com>
>
> wrote:
>
>
> Good to hear that helped. One might expect sum() to simply
> call viewSums(),
> but the semantics are a bit strange here. The reason sum()
> works on Views is
> that a Views is a Ranges and thus an IntegerList (where each
> range encodes a
> sequence of integers). The weird thing is that the elements
> of a Views are
> not the sequence of integers covered but rather the values
> in the Rle. That
> everything works as you expected is just a coincidence of
> dispatch.
>
> For usability we should probably have max(), min(), and
> sum() just use
> viewMaxs, viewMins and viewSums. It's more intuitive to
> think of an RleViews
> like an RleList rather than an IntegerList.
>
>
> On Sun, May 8, 2011 at 1:44 PM, Arnaud
> Amzallag<arnaud.amzallag at __gmail.com
> <mailto:arnaud.amzallag at gmail.com>
>
> wrote:
>
>
> Thank you Michael, the function viewSums was exactly
> what I needed !
>
> 0.014 seconds for viewSums(Views(myrle, ir)) vs 54
> seconds for
> sum(Views(myrle, ir)) on chr22, one sample. I use this
> now instead of of
> runsum, no problem of memory, and probably even faster.
> for full the genome
> on many samples that will surely help. Maybe I should
> have read a bit more
> about the Views.
>
> About the result of runsum, I did see a lot of memory
> usage when I split
> the process with mclapply. The result is indeed a Rle.
> After looking closer,
> the resuting Rle has much more runs that the original
> one. That makes sense,
> because runsum is a kind of smoothing function, and the
> resulting signal has
> much more levels than the original one.
>
> Kind regards,
>
> Arnaud
>
> On May 6, 2011, at 10:42 PM, Michael Lawrence wrote:
>
>
>
> On Fri, May 6, 2011 at 2:54 PM, Arnaud Amzallag<
> arnaud.amzallag at gmail.com
> <mailto:arnaud.amzallag at gmail.com>> wrote:
>
> Dear IRanges developers,
>
> runsum is a very fast and convenient function to
> compute on Rle
> coverages, for instance. However when it is run on
> several chromosomes and
> several samples, it can get very memory intensive.
> For instance on human
> chromosome 1, it outputs a vector of length 250
> millions, so for several
> full genomes it is quickly billions of numbers in
> memory.
>
>
> I would have expected the result to be an Rle, which
> would be fairly
> memory efficient.
>
>
> However, often you don't need a single base
> resolution. I wanted to
> suggest, if it is possible, to add a parameter by
> which one could have the
> sliding window to slide by a user defined step,
> rather than always "step=1",
> as it is now. Such that runsum(myRle, k=1e4, step =
> 1000) would return the
> equivalent of a wig file, for each 10 kilobases of
> the genome, without
> saturating the memory of the server.
>
> I tried with sum(Views(myRle, ir)), it is less
> memory intensive but it is
> much slower. So that amelioration would give the
> best of both worlds, fast
> and memory efficient.
>
>
> Have you tried viewSums(Views(myRle, ir))?
>
>
> kind regards,
>
> Arnaud Amzallag
> Research Fellow
> Mass general Cancer Center / Harvard Medical school
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list