[Bioc-devel] SummarizedExperiment subset of 4 dimensions

Wed Apr 1 14:08:25 CEST 2015

It would be nice if someone from Seattle would weigh in on this.

Also, we might want to consider an assayMatrix() accessor that always
returns an assay in 2D, except, as you suggest, it might be a matrix of
multiples (vectors, matrices, etc) by putting dimensions on a list. That
way, generic code can at least assume consistent dimensionality, even if
the values are complex. I don't really have any use cases though; just
seems possibly beneficial in the abstract.

On Wed, Apr 1, 2015 at 1:19 AM, Jesper Gådin <jesper.gadin at gmail.com> wrote:

> Hi Wolfgang and Michael,
>
> As Michael says, there is no redundant information in the 4D array I have,
> and all the values are integers.
>
> Of course I can simulate 4D by e.g. creating extra 3D arrays as assays
> equal to the length of the fourth dimension, but that makes the assay list
> a mess. It would also require me to write accessor functions that
> transforms the data into 4D before subsequent calculations (or to use a for
> loop..).
>
> Another option would be to include the 4D as a multiple in the 3D, which
> would not require a later transformation into 4D. If I understood correct,
> the array is just a long vector, which is indexed into different
> dimensions, and so everything in an SE object could as well be written as
> 2D. But (my belief is that) it is actually convenient to use the properties
> of dimensions for arrays.
>
> So if there is not a problem extending to 4D, I would be extremely
> grateful if you could take a look at it. :)
>
> Regards,
> Jesper
>
> On Tue, Mar 31, 2015 at 2:16 PM, Michael Lawrence <
> lawrence.michael at gene.com> wrote:
>
>> One would need a long-form colData that aligns with the array.
>>
>> But now I realize that's not what Jesper wants to do here, and is not how
>> SE is currently designed. Jesper is using the third (and now fourth)
>> dimension to store an additional dimension of information about the same
>> sample. We already support 3D arrays for this, presumably motivated VCF,
>> where, for example, each sample can have a probability for WT, het, or hom
>> at each position. In that case, all of the values are genotype likelihoods,
>> i.e., they all measure the same thing, so they seem to belong in the same
>> assay. But they're also the same biological "sample". Essentially, we have
>> complex measurements that might be a vector, or for Jesper even a matrix.
>>
>> The important question for interoperability is whether we want there to
>> be a contract that assays are always two dimensions. I guess we've already
>> violated that with VCF. Extending to a fourth is not really hurting
>> anything.
>>
>>
>> On Tue, Mar 31, 2015 at 4:52 AM, Wolfgang Huber <whuber at embl.de> wrote:
>>
>>>
>>> Hi Michael
>>>
>>> where would you put the “colData”-style metadata for the 3rd, 4th, …
>>> dimensions?
>>>
>>> As an (ex-)physicists of course I like arrays, and the more dimensions
>>> the better, but in practical work I’ve consistently been bitten by the
>>> rigidity of such a design choice too early in a process.
>>>
>>> Wolfgang
>>>
>>> On 31 Mar 2015, at 13:32, Michael Lawrence <lawrence.michael at gene.com>
>>> wrote:
>>>
>>> Taken in the abstract, the tidy data argument is one for consistent data
>>> structures that enable interoperability, which is what we have with
>>> SummarizedExperiment. The "long form" or "tidy" data frame is an effective
>>> general representation, but if there is additional structure in your data,
>>> why not represent it formally? Given the way R lays out the data in arrays,
>>> it should be possible to add that fourth dimension, in an assay array,
>>> while still using the colData to annotate that structure. It does not make
>>> the data any less "tidy", but it does make it more structured.
>>>
>>> On Tue, Mar 31, 2015 at 4:14 AM, Wolfgang Huber <whuber at embl.de> wrote:
>>>
>>>> Dear Jesper
>>>>
>>>> this is maybe not the answer you want to hear, but stuffing in 4, 5, …
>>>> dimensions may not be all that useful, as you can always roll out these
>>>> higher dimensions into the existing third (or even into the second, the
>>>> SummarizedExperiment columns). There is Hadley’s concept of “tidy data”
>>>> (see e.g. http://www.jstatsoft.org/v59/i10 ) — a paper that is really
>>>> worthwhile to read — which implies that the tidy way forward is to stay
>>>> with 2 (or maybe 3) dimensions in SummarizedExperiment, and to record the
>>>> information that you’d otherwise stuff into the higher dimensions in the
>>>> colData covariates.
>>>>
>>>> Wolfgang
>>>>
>>>> Wolfgang Huber
>>>> Principal Investigator, EMBL Senior Scientist
>>>> Genome Biology Unit
>>>> European Molecular Biology Laboratory (EMBL)
>>>> Heidelberg, Germany
>>>>
>>>> T +49-6221-3878823
>>>> wolfgang.huber at embl.de
>>>> http://www.huber.embl.de
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> > On 30 Mar 2015, at 12:38, Jesper Gådin <jesper.gadin at gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi!
>>>> >
>>>> > The SummarizedExperiment class is an extremely powerful container for
>>>> > biological data(thank you!), and all my thinking nowadays is just
>>>> circling
>>>> > around how to stuff it as effectively as possible.
>>>> >
>>>> > Have been using 3 dimension for a long time, which has been very
>>>> > successful. Now I also have a case for using 4 dimensions. Everything
>>>> > seemed to work as expected until I tried to subset my object, see
>>>> example.
>>>> >
>>>> > library(GenomicRanges)
>>>> >
>>>> > rowRanges <- GRanges(
>>>> >                seqnames="chrx",
>>>> >                ranges=IRanges(start=1:3,end=4:6),
>>>> >                strand="*"
>>>> >                )
>>>> >
>>>> > coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
>>>> >
>>>> > assays <- SimpleList()
>>>> >
>>>> > #two dim
>>>> > assays[["dim2"]] <- array(0,dim=c(3,3))
>>>> > se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>> colData=coldata)
>>>> > se[1]
>>>> > #works
>>>> >
>>>> > #three dim
>>>> > assays[["dim3"]] <- array(0,dim=c(3,3,3))
>>>> > se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>> colData=coldata)
>>>> > se[1]
>>>> > #works
>>>> >
>>>> > #four dim
>>>> > assays[["dim4"]] <- array(0,dim=c(3,3,3,3))
>>>> > se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>> colData=coldata)
>>>> > se[1]
>>>> > #does not work
>>>> > #Error in x[i, , , drop = FALSE] : incorrect number of dimensions
>>>> >
>>>> > This is also the case for rbind and cbind. Would it be appropriate to
>>>> ask
>>>> > you to update the SE functions to handle subset, rbind, cbind also
>>>> for 4
>>>> > dimensions? I know the time for next release is very soon, so maybe
>>>> it is
>>>> > better to wait until after April 16. Just let me know your thoughts
>>>> about
>>>> > it.
>>>> >
>>>> > Jesper
>>>> >
>>>> >       [[alternative HTML version deleted]]
>>>> >
>>>> > _______________________________________________
>>>> > Bioc-devel at r-project.org mailing list
>>>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>
>>>
>>
>

	[[alternative HTML version deleted]]