[Bioc-devel] SummarizedExperiment subset of 4 dimensions

Mon Apr 6 21:10:51 CEST 2015

Thank you all very much!

Jesper

On Wed, Apr 1, 2015 at 9:54 PM, Martin Morgan <mtmorgan at fredhutch.org>
wrote:

> On 04/01/2015 07:07 AM, Martin Morgan wrote:
>
>> On 04/01/2015 05:08 AM, Michael Lawrence wrote:
>>
>>> It would be nice if someone from Seattle would weigh in on this.
>>>
>>
>> I was hoping to weigh in with 'it's done' but will instead with 'it will
>> be done'.
>>
>
> 4-dimensional assays, advisable or otherwise, are available in
> GenomicRanges 1.19.49. Thanks for your patience, and for the discussion.
> Martin
>
>
>
>> A second aspect of Jesper's data that took me a little by surprise and is
>> related to Michael's comment below was that assays() can simultaneously
>> hold
>> arrays of 2, 3, (and 4) dimensions.
>>
>> Martin
>>
>>
>>> Also, we might want to consider an assayMatrix() accessor that always
>>> returns an assay in 2D, except, as you suggest, it might be a matrix of
>>> multiples (vectors, matrices, etc) by putting dimensions on a list. That
>>> way, generic code can at least assume consistent dimensionality, even if
>>> the values are complex. I don't really have any use cases though; just
>>> seems possibly beneficial in the abstract.
>>>
>>> On Wed, Apr 1, 2015 at 1:19 AM, Jesper Gådin <jesper.gadin at gmail.com>
>>> wrote:
>>>
>>>  Hi Wolfgang and Michael,
>>>>
>>>> As Michael says, there is no redundant information in the 4D array I
>>>> have,
>>>> and all the values are integers.
>>>>
>>>> Of course I can simulate 4D by e.g. creating extra 3D arrays as assays
>>>> equal to the length of the fourth dimension, but that makes the assay
>>>> list
>>>> a mess. It would also require me to write accessor functions that
>>>> transforms the data into 4D before subsequent calculations (or to use a
>>>> for
>>>> loop..).
>>>>
>>>> Another option would be to include the 4D as a multiple in the 3D, which
>>>> would not require a later transformation into 4D. If I understood
>>>> correct,
>>>> the array is just a long vector, which is indexed into different
>>>> dimensions, and so everything in an SE object could as well be written
>>>> as
>>>> 2D. But (my belief is that) it is actually convenient to use the
>>>> properties
>>>> of dimensions for arrays.
>>>>
>>>> So if there is not a problem extending to 4D, I would be extremely
>>>> grateful if you could take a look at it. :)
>>>>
>>>> Regards,
>>>> Jesper
>>>>
>>>> On Tue, Mar 31, 2015 at 2:16 PM, Michael Lawrence <
>>>> lawrence.michael at gene.com> wrote:
>>>>
>>>>  One would need a long-form colData that aligns with the array.
>>>>>
>>>>> But now I realize that's not what Jesper wants to do here, and is not
>>>>> how
>>>>> SE is currently designed. Jesper is using the third (and now fourth)
>>>>> dimension to store an additional dimension of information about the
>>>>> same
>>>>> sample. We already support 3D arrays for this, presumably motivated
>>>>> VCF,
>>>>> where, for example, each sample can have a probability for WT, het, or
>>>>> hom
>>>>> at each position. In that case, all of the values are genotype
>>>>> likelihoods,
>>>>> i.e., they all measure the same thing, so they seem to belong in the
>>>>> same
>>>>> assay. But they're also the same biological "sample". Essentially, we
>>>>> have
>>>>> complex measurements that might be a vector, or for Jesper even a
>>>>> matrix.
>>>>>
>>>>> The important question for interoperability is whether we want there to
>>>>> be a contract that assays are always two dimensions. I guess we've
>>>>> already
>>>>> violated that with VCF. Extending to a fourth is not really hurting
>>>>> anything.
>>>>>
>>>>>
>>>>> On Tue, Mar 31, 2015 at 4:52 AM, Wolfgang Huber <whuber at embl.de>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> Hi Michael
>>>>>>
>>>>>> where would you put the “colData”-style metadata for the 3rd, 4th, …
>>>>>> dimensions?
>>>>>>
>>>>>> As an (ex-)physicists of course I like arrays, and the more dimensions
>>>>>> the better, but in practical work I’ve consistently been bitten by the
>>>>>> rigidity of such a design choice too early in a process.
>>>>>>
>>>>>> Wolfgang
>>>>>>
>>>>>> On 31 Mar 2015, at 13:32, Michael Lawrence <lawrence.michael at gene.com
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> Taken in the abstract, the tidy data argument is one for consistent
>>>>>> data
>>>>>> structures that enable interoperability, which is what we have with
>>>>>> SummarizedExperiment. The "long form" or "tidy" data frame is an
>>>>>> effective
>>>>>> general representation, but if there is additional structure in your
>>>>>> data,
>>>>>> why not represent it formally? Given the way R lays out the data in
>>>>>> arrays,
>>>>>> it should be possible to add that fourth dimension, in an assay array,
>>>>>> while still using the colData to annotate that structure. It does not
>>>>>> make
>>>>>> the data any less "tidy", but it does make it more structured.
>>>>>>
>>>>>> On Tue, Mar 31, 2015 at 4:14 AM, Wolfgang Huber <whuber at embl.de>
>>>>>> wrote:
>>>>>>
>>>>>>  Dear Jesper
>>>>>>>
>>>>>>> this is maybe not the answer you want to hear, but stuffing in 4, 5,
>>>>>>> …
>>>>>>> dimensions may not be all that useful, as you can always roll out
>>>>>>> these
>>>>>>> higher dimensions into the existing third (or even into the second,
>>>>>>> the
>>>>>>> SummarizedExperiment columns). There is Hadley’s concept of “tidy
>>>>>>> data”
>>>>>>> (see e.g. http://www.jstatsoft.org/v59/i10 ) — a paper that is
>>>>>>> really
>>>>>>> worthwhile to read — which implies that the tidy way forward is to
>>>>>>> stay
>>>>>>> with 2 (or maybe 3) dimensions in SummarizedExperiment, and to
>>>>>>> record the
>>>>>>> information that you’d otherwise stuff into the higher dimensions in
>>>>>>> the
>>>>>>> colData covariates.
>>>>>>>
>>>>>>> Wolfgang
>>>>>>>
>>>>>>> Wolfgang Huber
>>>>>>> Principal Investigator, EMBL Senior Scientist
>>>>>>> Genome Biology Unit
>>>>>>> European Molecular Biology Laboratory (EMBL)
>>>>>>> Heidelberg, Germany
>>>>>>>
>>>>>>> T +49-6221-3878823
>>>>>>> wolfgang.huber at embl.de
>>>>>>> http://www.huber.embl.de
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On 30 Mar 2015, at 12:38, Jesper Gådin <jesper.gadin at gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> The SummarizedExperiment class is an extremely powerful container
>>>>>>>> for
>>>>>>>> biological data(thank you!), and all my thinking nowadays is just
>>>>>>>>
>>>>>>> circling
>>>>>>>
>>>>>>>> around how to stuff it as effectively as possible.
>>>>>>>>
>>>>>>>> Have been using 3 dimension for a long time, which has been very
>>>>>>>> successful. Now I also have a case for using 4 dimensions.
>>>>>>>> Everything
>>>>>>>> seemed to work as expected until I tried to subset my object, see
>>>>>>>>
>>>>>>> example.
>>>>>>>
>>>>>>>>
>>>>>>>> library(GenomicRanges)
>>>>>>>>
>>>>>>>> rowRanges <- GRanges(
>>>>>>>>                 seqnames="chrx",
>>>>>>>>                 ranges=IRanges(start=1:3,end=4:6),
>>>>>>>>                 strand="*"
>>>>>>>>                 )
>>>>>>>>
>>>>>>>> coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
>>>>>>>>
>>>>>>>> assays <- SimpleList()
>>>>>>>>
>>>>>>>> #two dim
>>>>>>>> assays[["dim2"]] <- array(0,dim=c(3,3))
>>>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>>>>>
>>>>>>> colData=coldata)
>>>>>>>
>>>>>>>> se[1]
>>>>>>>> #works
>>>>>>>>
>>>>>>>> #three dim
>>>>>>>> assays[["dim3"]] <- array(0,dim=c(3,3,3))
>>>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>>>>>
>>>>>>> colData=coldata)
>>>>>>>
>>>>>>>> se[1]
>>>>>>>> #works
>>>>>>>>
>>>>>>>> #four dim
>>>>>>>> assays[["dim4"]] <- array(0,dim=c(3,3,3,3))
>>>>>>>> se <- SummarizedExperiment(assays, rowRanges = rowRanges,
>>>>>>>>
>>>>>>> colData=coldata)
>>>>>>>
>>>>>>>> se[1]
>>>>>>>> #does not work
>>>>>>>> #Error in x[i, , , drop = FALSE] : incorrect number of dimensions
>>>>>>>>
>>>>>>>> This is also the case for rbind and cbind. Would it be appropriate
>>>>>>>> to
>>>>>>>>
>>>>>>> ask
>>>>>>>
>>>>>>>> you to update the SE functions to handle subset, rbind, cbind also
>>>>>>>>
>>>>>>> for 4
>>>>>>>
>>>>>>>> dimensions? I know the time for next release is very soon, so maybe
>>>>>>>>
>>>>>>> it is
>>>>>>>
>>>>>>>> better to wait until after April 16. Just let me know your thoughts
>>>>>>>>
>>>>>>> about
>>>>>>>
>>>>>>>> it.
>>>>>>>>
>>>>>>>> Jesper
>>>>>>>>
>>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

	[[alternative HTML version deleted]]