[Bioc-devel] SummarizedExperiment subset of 4 dimensions

Wed Apr 1 16:07:01 CEST 2015

On 04/01/2015 05:08 AM, Michael Lawrence wrote:
> It would be nice if someone from Seattle would weigh in on this.

I was hoping to weigh in with 'it's done' but will instead with 'it will be done'.

A second aspect of Jesper's data that took me a little by surprise and is 
related to Michael's comment below was that assays() can simultaneously hold 
arrays of 2, 3, (and 4) dimensions.

Martin

>
> Also, we might want to consider an assayMatrix() accessor that always
> returns an assay in 2D, except, as you suggest, it might be a matrix of
> multiples (vectors, matrices, etc) by putting dimensions on a list. That
> way, generic code can at least assume consistent dimensionality, even if
> the values are complex. I don't really have any use cases though; just
> seems possibly beneficial in the abstract.
>
> On Wed, Apr 1, 2015 at 1:19 AM, Jesper Gådin <jesper.gadin at gmail.com> wrote:
>
>> Hi Wolfgang and Michael,
>>
>> As Michael says, there is no redundant information in the 4D array I have,
>> and all the values are integers.
>>
>> Of course I can simulate 4D by e.g. creating extra 3D arrays as assays
>> equal to the length of the fourth dimension, but that makes the assay list
>> a mess. It would also require me to write accessor functions that
>> transforms the data into 4D before subsequent calculations (or to use a for
>> loop..).
>>
>> Another option would be to include the 4D as a multiple in the 3D, which
>> would not require a later transformation into 4D. If I understood correct,
>> the array is just a long vector, which is indexed into different
>> dimensions, and so everything in an SE object could as well be written as
>> 2D. But (my belief is that) it is actually convenient to use the properties
>> of dimensions for arrays.
>>
>> So if there is not a problem extending to 4D, I would be extremely
>> grateful if you could take a look at it. :)
>>
>> Regards,
>> Jesper
>>
>> On Tue, Mar 31, 2015 at 2:16 PM, Michael Lawrence <
>> lawrence.michael at gene.com> wrote:
>>
>>> One would need a long-form colData that aligns with the array.
>>>
>>> But now I realize that's not what Jesper wants to do here, and is not how
>>> SE is currently designed. Jesper is using the third (and now fourth)
>>> dimension to store an additional dimension of information about the same
>>> sample. We already support 3D arrays for this, presumably motivated VCF,
>>> where, for example, each sample can have a probability for WT, het, or hom
>>> at each position. In that case, all of the values are genotype likelihoods,
>>> i.e., they all measure the same thing, so they seem to belong in the same
>>> assay. But they're also the same biological "sample". Essentially, we have
>>> complex measurements that might be a vector, or for Jesper even a matrix.
>>>
>>> The important question for interoperability is whether we want there to
>>> be a contract that assays are always two dimensions. I guess we've already
>>> violated that with VCF. Extending to a fourth is not really hurting
>>> anything.
>>>
>>>
>>> On Tue, Mar 31, 2015 at 4:52 AM, Wolfgang Huber <whuber at embl.de> wrote:
>>>
>>>>
>>>> Hi Michael
>>>>
>>>> where would you put the “colData”-style metadata for the 3rd, 4th, …
>>>> dimensions?
>>>>
>>>> As an (ex-)physicists of course I like arrays, and the more dimensions
>>>> the better, but in practical work I’ve consistently been bitten by the
>>>> rigidity of such a design choice too early in a process.
>>>>
>>>> Wolfgang
>>>>
>>>> On 31 Mar 2015, at 13:32, Michael Lawrence <lawrence.michael at gene.com>
>>>> wrote:
>>>>
>>>> Taken in the abstract, the tidy data argument is one for consistent data
>>>> structures that enable interoperability, which is what we have with
>>>> SummarizedExperiment. The "long form" or "tidy" data frame is an effective
>>>> general representation, but if there is additional structure in your data,
>>>> why not represent it formally? Given the way R lays out the data in arrays,
>>>> it should be possible to add that fourth dimension, in an assay array,
>>>> while still using the colData to annotate that structure. It does not make
>>>> the data any less "tidy", but it does make it more structured.
>>>>
>>>> On Tue, Mar 31, 2015 at 4:14 AM, Wolfgang Huber <whuber at embl.de> wrote:
>>>>
>>>>> Dear Jesper
>>>>>
>>>>> this is maybe not the answer you want to hear, but stuffing in 4, 5, …
>>>>> dimensions may not be all that useful, as you can always roll out these
>>>>> higher dimensions into the existing third (or even into the second, the
>>>>> SummarizedExperiment columns). There is Hadley’s concept of “tidy data”
>>>>> (see e.g. http://www.jstatsoft.org/v59/i10 ) — a paper that is really
>>>>> worthwhile to read — which implies that the tidy way forward is to stay
>>>>> with 2 (or maybe 3) dimensions in SummarizedExperiment, and to record the
>>>>> information that you’d otherwise stuff into the higher dimensions in the
>>>>> colData covariates.
>>>>>
>>>>> Wolfgang
>>>>>
>>>>> Wolfgang Huber
>>>>> Principal Investigator, EMBL Senior Scientist
>>>>> Genome Biology Unit
>>>>> European Molecular Biology Laboratory (EMBL)
>>>>> Heidelberg, Germany
>>>>>
>>>>> T +49-6221-3878823
>>>>> wolfgang.huber at embl.de
>>>>> http://www.huber.embl.de
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 30 Mar 2015, at 12:38, Jesper Gådin <jesper.gadin at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> The SummarizedExperiment class is an extremely powerful container for
>>>>>> biological data(thank you!), and all my thinking nowadays is just
>>>>> circling
>>>>>> around how to stuff it as effectively as possible.
>>>>>>
>>>>>> Have been using 3 dimension for a long time, which has been very
>>>>>> successful. Now I also have a case for using 4 dimensions. Everything
>>>>>> seemed to work as expected until I tried to subset my object, see
>>>>> example.
>>>>>>
>>>>>> library(GenomicRanges)
>>>>>>
>>>>>> rowRanges <- GRanges(
>>>>>>                 seqnames="chrx",
>>>>>>                 ranges=IRanges(start=1:3,end=4:6),
>>>>>>                 strand="*"
>>>>>>                 )
>>>>>>
>>>>>> coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
>>>>>>
>>>>>> assays <- SimpleList()
>>>>>>
>>>>>> #two dim
>>>>>> assays[["dim2"]] <- array(0,dim=c(3,3))
>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>> colData=coldata)
>>>>>> se[1]
>>>>>> #works
>>>>>>
>>>>>> #three dim
>>>>>> assays[["dim3"]] <- array(0,dim=c(3,3,3))
>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>> colData=coldata)
>>>>>> se[1]
>>>>>> #works
>>>>>>
>>>>>> #four dim
>>>>>> assays[["dim4"]] <- array(0,dim=c(3,3,3,3))
>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>> colData=coldata)
>>>>>> se[1]
>>>>>> #does not work
>>>>>> #Error in x[i, , , drop = FALSE] : incorrect number of dimensions
>>>>>>
>>>>>> This is also the case for rbind and cbind. Would it be appropriate to
>>>>> ask
>>>>>> you to update the SE functions to handle subset, rbind, cbind also
>>>>> for 4
>>>>>> dimensions? I know the time for next release is very soon, so maybe
>>>>> it is
>>>>>> better to wait until after April 16. Just let me know your thoughts
>>>>> about
>>>>>> it.
>>>>>>
>>>>>> Jesper
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793