[Bioc-devel] SummarizedExperiment subset of 4 dimensions

Martin Morgan mtmorgan at fredhutch.org
Wed Apr 1 21:54:43 CEST 2015


On 04/01/2015 07:07 AM, Martin Morgan wrote:
> On 04/01/2015 05:08 AM, Michael Lawrence wrote:
>> It would be nice if someone from Seattle would weigh in on this.
>
> I was hoping to weigh in with 'it's done' but will instead with 'it will be done'.

4-dimensional assays, advisable or otherwise, are available in GenomicRanges 
1.19.49. Thanks for your patience, and for the discussion. Martin

>
> A second aspect of Jesper's data that took me a little by surprise and is
> related to Michael's comment below was that assays() can simultaneously hold
> arrays of 2, 3, (and 4) dimensions.
>
> Martin
>
>>
>> Also, we might want to consider an assayMatrix() accessor that always
>> returns an assay in 2D, except, as you suggest, it might be a matrix of
>> multiples (vectors, matrices, etc) by putting dimensions on a list. That
>> way, generic code can at least assume consistent dimensionality, even if
>> the values are complex. I don't really have any use cases though; just
>> seems possibly beneficial in the abstract.
>>
>> On Wed, Apr 1, 2015 at 1:19 AM, Jesper Gådin <jesper.gadin at gmail.com> wrote:
>>
>>> Hi Wolfgang and Michael,
>>>
>>> As Michael says, there is no redundant information in the 4D array I have,
>>> and all the values are integers.
>>>
>>> Of course I can simulate 4D by e.g. creating extra 3D arrays as assays
>>> equal to the length of the fourth dimension, but that makes the assay list
>>> a mess. It would also require me to write accessor functions that
>>> transforms the data into 4D before subsequent calculations (or to use a for
>>> loop..).
>>>
>>> Another option would be to include the 4D as a multiple in the 3D, which
>>> would not require a later transformation into 4D. If I understood correct,
>>> the array is just a long vector, which is indexed into different
>>> dimensions, and so everything in an SE object could as well be written as
>>> 2D. But (my belief is that) it is actually convenient to use the properties
>>> of dimensions for arrays.
>>>
>>> So if there is not a problem extending to 4D, I would be extremely
>>> grateful if you could take a look at it. :)
>>>
>>> Regards,
>>> Jesper
>>>
>>> On Tue, Mar 31, 2015 at 2:16 PM, Michael Lawrence <
>>> lawrence.michael at gene.com> wrote:
>>>
>>>> One would need a long-form colData that aligns with the array.
>>>>
>>>> But now I realize that's not what Jesper wants to do here, and is not how
>>>> SE is currently designed. Jesper is using the third (and now fourth)
>>>> dimension to store an additional dimension of information about the same
>>>> sample. We already support 3D arrays for this, presumably motivated VCF,
>>>> where, for example, each sample can have a probability for WT, het, or hom
>>>> at each position. In that case, all of the values are genotype likelihoods,
>>>> i.e., they all measure the same thing, so they seem to belong in the same
>>>> assay. But they're also the same biological "sample". Essentially, we have
>>>> complex measurements that might be a vector, or for Jesper even a matrix.
>>>>
>>>> The important question for interoperability is whether we want there to
>>>> be a contract that assays are always two dimensions. I guess we've already
>>>> violated that with VCF. Extending to a fourth is not really hurting
>>>> anything.
>>>>
>>>>
>>>> On Tue, Mar 31, 2015 at 4:52 AM, Wolfgang Huber <whuber at embl.de> wrote:
>>>>
>>>>>
>>>>> Hi Michael
>>>>>
>>>>> where would you put the “colData”-style metadata for the 3rd, 4th, …
>>>>> dimensions?
>>>>>
>>>>> As an (ex-)physicists of course I like arrays, and the more dimensions
>>>>> the better, but in practical work I’ve consistently been bitten by the
>>>>> rigidity of such a design choice too early in a process.
>>>>>
>>>>> Wolfgang
>>>>>
>>>>> On 31 Mar 2015, at 13:32, Michael Lawrence <lawrence.michael at gene.com>
>>>>> wrote:
>>>>>
>>>>> Taken in the abstract, the tidy data argument is one for consistent data
>>>>> structures that enable interoperability, which is what we have with
>>>>> SummarizedExperiment. The "long form" or "tidy" data frame is an effective
>>>>> general representation, but if there is additional structure in your data,
>>>>> why not represent it formally? Given the way R lays out the data in arrays,
>>>>> it should be possible to add that fourth dimension, in an assay array,
>>>>> while still using the colData to annotate that structure. It does not make
>>>>> the data any less "tidy", but it does make it more structured.
>>>>>
>>>>> On Tue, Mar 31, 2015 at 4:14 AM, Wolfgang Huber <whuber at embl.de> wrote:
>>>>>
>>>>>> Dear Jesper
>>>>>>
>>>>>> this is maybe not the answer you want to hear, but stuffing in 4, 5, …
>>>>>> dimensions may not be all that useful, as you can always roll out these
>>>>>> higher dimensions into the existing third (or even into the second, the
>>>>>> SummarizedExperiment columns). There is Hadley’s concept of “tidy data”
>>>>>> (see e.g. http://www.jstatsoft.org/v59/i10 ) — a paper that is really
>>>>>> worthwhile to read — which implies that the tidy way forward is to stay
>>>>>> with 2 (or maybe 3) dimensions in SummarizedExperiment, and to record the
>>>>>> information that you’d otherwise stuff into the higher dimensions in the
>>>>>> colData covariates.
>>>>>>
>>>>>> Wolfgang
>>>>>>
>>>>>> Wolfgang Huber
>>>>>> Principal Investigator, EMBL Senior Scientist
>>>>>> Genome Biology Unit
>>>>>> European Molecular Biology Laboratory (EMBL)
>>>>>> Heidelberg, Germany
>>>>>>
>>>>>> T +49-6221-3878823
>>>>>> wolfgang.huber at embl.de
>>>>>> http://www.huber.embl.de
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 30 Mar 2015, at 12:38, Jesper Gådin <jesper.gadin at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> The SummarizedExperiment class is an extremely powerful container for
>>>>>>> biological data(thank you!), and all my thinking nowadays is just
>>>>>> circling
>>>>>>> around how to stuff it as effectively as possible.
>>>>>>>
>>>>>>> Have been using 3 dimension for a long time, which has been very
>>>>>>> successful. Now I also have a case for using 4 dimensions. Everything
>>>>>>> seemed to work as expected until I tried to subset my object, see
>>>>>> example.
>>>>>>>
>>>>>>> library(GenomicRanges)
>>>>>>>
>>>>>>> rowRanges <- GRanges(
>>>>>>>                 seqnames="chrx",
>>>>>>>                 ranges=IRanges(start=1:3,end=4:6),
>>>>>>>                 strand="*"
>>>>>>>                 )
>>>>>>>
>>>>>>> coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
>>>>>>>
>>>>>>> assays <- SimpleList()
>>>>>>>
>>>>>>> #two dim
>>>>>>> assays[["dim2"]] <- array(0,dim=c(3,3))
>>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>>> colData=coldata)
>>>>>>> se[1]
>>>>>>> #works
>>>>>>>
>>>>>>> #three dim
>>>>>>> assays[["dim3"]] <- array(0,dim=c(3,3,3))
>>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>>> colData=coldata)
>>>>>>> se[1]
>>>>>>> #works
>>>>>>>
>>>>>>> #four dim
>>>>>>> assays[["dim4"]] <- array(0,dim=c(3,3,3,3))
>>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>>> colData=coldata)
>>>>>>> se[1]
>>>>>>> #does not work
>>>>>>> #Error in x[i, , , drop = FALSE] : incorrect number of dimensions
>>>>>>>
>>>>>>> This is also the case for rbind and cbind. Would it be appropriate to
>>>>>> ask
>>>>>>> you to update the SE functions to handle subset, rbind, cbind also
>>>>>> for 4
>>>>>>> dimensions? I know the time for next release is very soon, so maybe
>>>>>> it is
>>>>>>> better to wait until after April 16. Just let me know your thoughts
>>>>>> about
>>>>>>> it.
>>>>>>>
>>>>>>> Jesper
>>>>>>>
>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list