[BioC] matrix like object with Rle columns

Jeff Leek jtleek at gmail.com
Wed Jun 27 19:46:38 CEST 2012


I would love/use all the time this feature if it existed.

Jeff

On Wed, Jun 27, 2012 at 11:21 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> On Wed, Jun 27, 2012 at 8:07 AM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com> wrote:
>
>> One comment:  since matrix is a vector with a dim attribute I see that
>> the natural parallel is doing the same for Rle.
>
>
> Right, in the original plan, the Array class would bring the dim attribute,
> and RleMatrix would contain both Matrix and Rle.
>
>
>>  Nevertheless, that
>> would put an upper limit on the number of runLengths in the entire
>> matrix.  My impression (which could be wrong) is that we would need to
>> implement essentially all matrix-like numeric operations from scratch
>> anyway, so it may be worthwhile to consider using a list of Rle's
>> where each Rle is a column, instead of a single Rle to represent all
>> columns.  Clearly that depends on implementation details, but if we
>> really need to do everything from scratch, a list of columns might be
>> more flexible (and perhaps even easier to code).
>>
>>
> This would make it harder to treat RleMatrix as an Rle (which is a nice
> feature of base R matrices). If the problem is the vector length limit,
> then I'd rather wait for Luke's fix, which apparently is coming along.
>
> Kasper
>>
>> On Tue, Jun 26, 2012 at 6:41 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>> > Seems like it could be a nice thing to have. Presumably one would create
>> an
>> > Array subclass of Vector that would add a "dim" attribute. Then Matrix
>> could
>> > extend that to constrain dim to length two (unfortunately colliding with
>> the
>> > Matrix class in the Matrix package). Then RleMatrix extends Matrix to
>> > implement the actual data storage and many of the accelerated methods. As
>> > you said, row-oriented methods would be tough.
>> >
>> > Any takers?
>> >
>> > Michael
>> >
>> > On Mon, Jun 25, 2012 at 9:11 PM, Kasper Daniel Hansen
>> > <kasperdanielhansen at gmail.com> wrote:
>> >>
>> >> On Mon, Jun 25, 2012 at 11:56 PM, Kasper Daniel Hansen
>> >> <kasperdanielhansen at gmail.com> wrote:
>> >> > On Mon, Jun 25, 2012 at 11:36 PM, Michael Lawrence
>> >> > <lawrence.michael at gene.com> wrote:
>> >> >> Patrick and I had talked about this a long time ago (essentially
>> >> >> putting a
>> >> >> "dim" attribute on an Rle), but the closest thing today is a
>> DataFrame
>> >> >> with
>> >> >> Rle columns.
>> >> >>
>> >> >> Use case?
>> >> >
>> >> > Say I have whole-genome data (for example coverage)  on multiple
>> >> > samples.  Usually, this is far easier to think of as a matrix (in my
>> >> > opinion) with ~3B rows and I often want to do rowSums(), colSums() etc
>> >> > (in fact, probably the whole API from matrixStats).  This is
>> >> > especially nice when you have multiple coverage-like tracks on each
>> >> > sample, so you could have
>> >> >  trackA : genome by samples
>> >> >  trackB : genome by samples
>> >> >  ...
>> >> >
>> >> > You could think of this as a SummarizedExperiment, but with
>> >> > _extremely_ big matrices in the assay slot.
>> >> >
>> >> > I want to take advantage of the Rle structure to store the data more
>> >> > efficiently and also to do potentially faster computations.
>> >> >
>> >> > This is actually closer to my use case where I currently use matrices
>> >> > with ~30M rows (which works fine), but I would like to expand to ~800M
>> >> > rows (which would suck a bit).
>> >> >
>> >> > You could also think of a matrix-like object with Rle columns as an
>> >> > alternative sparse matrix structure.  In a typical sparse matrix you
>> >> > only store the non-zero entities, here we only store the
>> >> > change-points.  Depending on the structure of the matrix this could be
>> >> > an efficient storage of an otherwise dense matrix.
>> >> >
>> >> > So essentially, what I want, is to have mathematical operations on
>> >> > this object, where I would utilize that I know that all entities are
>> >> > numbers so the typical matrix operations makes sense.
>> >> >
>> >> > [ side question which could be relevant in this discussion: for a
>> >> > numeric Rle is there some notion of precision - say I have truly
>> >> > numeric values with tons of digits, and I want to consider two numbers
>> >> > part of the same run if |x1 -x2|<epsilon? ]
>> >>
>> >> You can see that Pete has had similar thoughts in
>> >> genoset/R/DataFrame-methods.R, although he only has colMeans (which is
>> >> the easy one).
>> >>
>> >> Kasper
>> >>
>> >> > Kasper
>> >> >
>> >> >>
>> >> >> Michael
>> >> >>
>> >> >> On Mon, Jun 25, 2012 at 8:27 PM, Kasper Daniel Hansen
>> >> >> <kasperdanielhansen at gmail.com> wrote:
>> >> >>>
>> >> >>> Do we have a matrix-like object, but where the columns are Rle's?
>> >> >>>
>> >> >>> Kasper
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> Bioconductor mailing list
>> >> >>> Bioconductor at r-project.org
>> >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> >>> Search the archives:
>> >> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >> >>
>> >> >>
>> >
>> >
>>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list