[Bioc-devel] SummarizedExperiments

Martin Morgan mtmorgan at fhcrc.org
Sun Sep 2 23:47:56 CEST 2012


On 08/30/2012 08:47 AM, Vincent Carey wrote:
> I am all in favor of the dialogue and the aim.  I reproduced your first
> timing, and note
>
>> unix.time(ss2 at colData$tx <- 1:1000)
>     user  system elapsed
>    5.963   4.520  10.483
>> unix.time(colData(ss2)$tx2 <- 1:1000)
>     user  system elapsed
>   17.937  13.074  31.016
>
> BTW did you really mean to put the meth/unmeth in exptData, or should it be
> in assays?
>
>>From an experimentation perspective, I would want to redo these
> computations above with an environment holding the assay data and see how
> it changes.  I suppose one can infer from first principles that
>
>> ss4 at assays[[1]]$meth[1:4,1:4]
>       [,1] [,2] [,3] [,4]
> [1,]  0.2  0.2  0.2  0.2
> [2,]  0.2  0.2  0.2  0.2
> [3,]  0.2  0.2  0.2  0.2
> [4,]  0.2  0.2  0.2  0.2
>> ss4 at assays[[2]]$unmeth[1:4,1:4]
>       [,1] [,2] [,3] [,4]
> [1,]  0.2  0.2  0.2  0.2
> [2,]  0.2  0.2  0.2  0.2
> [3,]  0.2  0.2  0.2  0.2
> [4,]  0.2  0.2  0.2  0.2
>> unix.time(colData(ss4)$tx2 <- 1:1000)
>     user  system elapsed
>    0.010   0.002   0.012
>
> where ss4 has a SimpleList of environments holding the assay data.
>
>> ss4
> class: SummarizedExperiment
> dim: 450000 1000
> exptData(0):
> assays(2): '' ''
> rownames: NULL
> rowData values names(0):
> colnames: NULL
> colData names(4): sampleID treatment tx tx2
>> ss4 at assays[[1]]
> <environment: 0x1f4e0ba0>
>
> This can't be attempted without disabling validity checking, and the stack
> trace from the attempt is instructive.
>
> I don't think it will be too difficult to establish a customized
> SummarizedExperiment that permits this, using the lockedEnvironment
> approach of ExpressionSets.  I also do not know the down sides but it seems
> you/we could get some mileage on the approach and see.

GenomicRanges v. 1.9.59 has some preliminary changes that address the 
performance issue. Serialized (saved) objects 'x' need to be updated 
with updateObject(x) (and be careful if these are your only instances of 
some hard work). There are some loose ends to tidy up.

Martin

>
>
> On Thu, Aug 30, 2012 at 10:53 AM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com> wrote:
>
>> My perspective (and the reason for me email) is pretty simple.
>>
>> First off, let me say that we will always be working on trading off
>> memory and IO and development time and all the other stuff.  I mean,
>> clearly the environment stuff for ExpressionSets did not help all
>> people, because we have xps, aroma.affymetrix and oligo all using some
>> kind of file based backend.
>>
>> I am not trying to solve something "big".  Putting methylation data
>> inside an environment is not suddenly going to make my algorithms go
>> _significantly_ faster or the memory usage to go _significantly_ down.
>>   But it makes life faster when you explore data - I believe.
>>
>> Now for two realistic examples (one minfi, one bsseq) that I encounter
>> all the time:
>> library(GenomicRanges)
>> nRows <- 450000
>> nCols <- 1000
>> SSet <- SummarizedExperiment(rowData = GRanges(seqnames = rep("chr1",
>> nRows),
>>                               ranges = IRanges(1:nRows, width = 1)),
>>                               colData = DataFrame(sampleID =
>> paste0("person", 1:nCols)),
>>                               exptData = SimpleList(meth = matrix(0.2,
>> ncol = nCols, nrow = nRows),
>>                               unmeth = matrix(0.2, ncol = nCols, nrow =
>> nRows)))
>> print(object.size(SSet), units = "auto")  # 6.7GB
>> system.time({
>>      colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
>> }) # around 26 sec
>>
>>
>> nRows <- 28000000
>> nCols <- 10
>> SSet <- SummarizedExperiment(rowData = GRanges(seqnames = rep("chr1",
>> nRows),
>>                               ranges = IRanges(1:nRows, width = 1)),
>>                               colData = DataFrame(sampleID =
>> paste0("person", 1:nCols)),
>>                               exptData = SimpleList(meth = matrix(0.2,
>> ncol = nCols, nrow = nRows),
>>                               unmeth = matrix(0.2, ncol = nCols, nrow =
>> nRows)))
>> print(object.size(SSet), units = "auto")  # 4.4GB
>> system.time({
>>      colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
>> }) # around 20 sec
>>
>> This is of course not that slow, especially not when compared to the
>> massive amount of computation I had to do in order to arrive at the
>> SSet in the first place.  But the difference is that I staring at the
>> screen when I work interactively, but I am out partying when I run my
>> computations (overnight).  And I do stuff like the above surprisingly
>> many times after I have generated the objects.
>>
>> My perspective is that this was essentially "solved" with the
>> ExpressionSet class, and I feel we are regressing a bit.  I also
>> believe that it should be "easy" to re-use some of the ideas/code from
>> ExpressionSet and that it is worthwhile to do so, for something that
>> may very well be an extremely important core class.
>> SummarizedExperiment is such a nice class that I essentially have a
>> parallel implementation in bsseq and Pete has one in genoset.
>>
>> So essentially, I was interested in fixing a pretty small problem, but
>> something that I am a bit irritated about.  And I feel we almost have
>> a ready to use solution.
>>
>> The reason I ask about it here, and not just do it myself, is twofold
>> (1) it is tricky to get the environment right, and I do not have
>> enough experience on the environment part of ExpressionSet, that I
>> feel comfortable doing it
>> (2) I really think it would be beneficial to the community, and
>> experience tells us, that what happens in these core infrastructure
>> packages are much more likely to be widely used.
>>
>> While I think that it may be worthwhile for people to play with
>> reference classes, I think that going down that route is unexplored
>> and may or may not be worth it.  There is something to be said for
>> building on something we have experience with.  A reference class idea
>> can be developed in parallel, if people are willing to commit the
>> time.
>>
>> Kasper
>>
>>
>> On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr. <tim.triche at gmail.com>
>> wrote:
>>>   I am interested in how best to benchmark such operations (and another
>>> thought that I recently had, "why can't I put a Doug Bates style 'Matrix'
>>> in the SimpleList of assay data)?  Kasper is the person who convinced me
>>> that ff, rhdf5, etc.should be a last resort, since RAM is cheap.
>>>
>>> For large data, the TCGA BRCA samples are a hair under 800 (last I
>> looked)
>>> 450k arrays with matching RNAseq and DNAseq, and that's not even a big
>>> study by breast cancer standards (IIRC, Gordon Mills was rounding up 5000
>>> tumors) Both minfi and methylumi can work off of the raw data in these
>>> studies as a "practical" example, but when it really gets interesting is
>>> when 3-4 ragged assays exist for each patient.
>>>
>>> Exon-wise (or splice-graph-wise) RNAseq data with splice junction read
>>> counts is another place to find big SE-centric data with ambiguities.
>> One
>>> of the reasons I keep coming back to the idea of subsetting an SE by a
>>> GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is its speed. In any
>>> event, after my obligations for the day are dispatched, I'll write up an
>>> example of doing this off of TCGA data, perhaps from colorectal (only
>> about
>>> 400 subjects but at least it is published already).  You are right, a
>>> single use case is often worth 1000 specifications.
>>>
>>> --t
>>>
>>>
>>>
>>> On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
>>> <stvjc at channing.harvard.edu>wrote:
>>>
>>>>
>>>>
>>>> On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis <sdavis2 at mail.nih.gov>
>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan <mtmorgan at fhcrc.org
>>> wrote:
>>>>>
>>>>>> On 08/30/2012 04:42 AM, Vincent Carey wrote:
>>>>>>
>>>>>>> On Thu, Aug 30, 2012 at 6:27 AM, Tim Triche, Jr. <
>> tim.triche at gmail.com
>>>>>>>> wrote:
>>>>>>>
>>>>>>>   nb.  one of the reasons for the existence of the MergedDataSet
>> class in
>>>>>>>> regulatoR (to be submitted for review shortly) is that, while SEs
>> are
>>>>>>>> absolutely fantastic for managing data matrices that are stapled to
>> a
>>>>>>>> GRanges, what is less awesome is having a relatively light-weight
>>>>>>>> DataFrame
>>>>>>>> for phenotypic data that requires the entire memory footprint be
>>>>>>>> recreated
>>>>>>>> upon writing a new column into said DataFrame.
>>>>>>>>
>>>>>>>> If R5 classes didn't spook me a little, I would already have done
>>>>>>>> something
>>>>>>>>
>>>>>>>>
>>>>>>> We don't often use this R5 terminology but I see Hadley has made an
>>>>>>> accessible document referring to reference classes in this way.
>>>>>>>
>>>>>>
>>>>>> To me the challenge is more conceptual -- pass-by-reference and the
>> way
>>>>>> that two variables pointing to the instance are updated at the same
>> time --
>>>>>> and I had been thinking of a LockedEnvironment-style implementation
>> where
>>>>>> some operations were free ('copying') but others weren't (subset,
>> subset
>>>>>> assign). But maybe there are some more direct approaches...
>>>>>>
>>>>>>
>>>>>>   My 2c: This is a situation where some experimental data would be
>>>>>>> helpful.
>>>>>>>
>>>>>>
>>>>>>
>>>>> Perhaps Tim or Kasper can share a largish (unpublished) dataset and a
>>>>> typical workflow with the Seattle folks.  Even that level of detail
>> would
>>>>> give some sense of the scope of the problem.
>>>>>
>>>>
>>>>
>>>> For clarification, I did not mean large data, although that would be
>>>> welcome.  I meant data on computational experiments with different
>>>> approaches -- that is, performance statistics that would indicate where
>>>> scalability fails, what steps have been taken to recover it, and how
>> costly
>>>> those steps are in terms of complexity, reliability, maintainability,
>> risk
>>>> at the user/application end.
>>>>
>>>>
>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>>> Yes, for instance where in the interactive use is time being spent? Is
>>>>>> it copying the assays, or validity, or actually updating the row
>> data? Is
>>>>>> 500000 x 800 an appropriate scale to be thinking about?
>>>>>>
>>>>>>
>>>>>>     The main avenues for a developer seem to be a) use environments or
>>>>>>> reference classes; there are some costs and we should understand
>> them,
>>>>>>> and
>>>>>>> b) use an out-of-memory approach like rhdf5 or ff.  Again there will
>> be
>>>>>>> some costs.  It should be relatively easy to experiment with these.
>>   One
>>>>>>> thing I just learned about is setValidity2 and disableValidity
>> (defined
>>>>>>> in
>>>>>>> IRanges IIRC) ... these allow you to construct certain variations on
>>>>>>> SummarizedExperiment with less attention to deeper infrastructure.
>>>>>>>
>>>>>>
>>>>>> probably I can make better use of the insights the IRanges guys have
>> had
>>>>>> in their careful development and application of validity methods,
>> though I
>>>>>> feel a bit like these are 'attractive hazards' that tempt us to do
>> unsafe
>>>>>> things and then pay the price later. This is likely the first
>> direction
>>>>>> I'll explore.
>>>>>>
>>>>>> Exploring a little I already see that there are some pretty dumb
>> things
>>>>>> being done in assignment.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>   whereby the assays/libraries for a given study subject are all
>> pointed
>>>>>>>> to
>>>>>>>> as SEs (i.e. RNAseq, BSseq, expression/methylation arrays, CNV/SNP
>>>>>>>> arrays,
>>>>>>>> WGS or exomic DNAseq) and the column (phenotype) data can avoid
>> being
>>>>>>>> subject to these constraints.  Truth be told I *still* want to do
>> that
>>>>>>>> because, most of the time, updates to the latter are independent of,
>>>>>>>> and
>>>>>>>> happen subsequently to, loading the former.
>>>>>>>>
>>>>>>>> Suggestions would be welcome, because other than these minor
>> niggles,
>>>>>>>> the
>>>>>>>> SummarizedExperiment class is almost perfect for many tasks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 29, 2012 at 9:57 PM, Tim Triche, Jr. <
>> tim.triche at gmail.com
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>
>>>>>>>>   assigning new colData columns, or overwriting old ones, in a
>> sizable
>>>>>>>>> (say
>>>>>>>>> 500000 row x 800 column) SE is nauseatingly slow.
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>>>> There has to be a better way -- I'm willing to write it if someone
>> can
>>>>>>>>> point out an obvious way to do it
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 29, 2012 at 9:52 PM, Kasper Daniel Hansen <
>>>>>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>   On Thu, Aug 30, 2012 at 12:44 AM, Martin Morgan <
>> mtmorgan at fhcrc.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> On 08/29/2012 06:46 PM, Kasper Daniel Hansen wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> There is a lot of good stuff to say about SummarizedExperiments,
>>>>>>>>>>>> and
>>>>>>>>>>>> from a certain point of view I have a parallel implementation in
>>>>>>>>>>>>
>>>>>>>>>>> bsseq
>>>>>>>>
>>>>>>>>>   (and there is also one in genoset).
>>>>>>>>>>>>
>>>>>>>>>>>> However, I really like having the assayData inside an
>> environment.
>>>>>>>>>>>> This helps some on memory and - equally important - speed at the
>>>>>>>>>>>> command line.  I certainly need to very heavily consider using
>> an
>>>>>>>>>>>> environment in bsseq.
>>>>>>>>>>>>
>>>>>>>>>>>> After some discussion with Tim (Triche) we have agreed that
>>>>>>>>>>>> something
>>>>>>>>>>>> like SummarizedExperiments is the way to go at least for the
>>>>>>>>>>>> methylation arrays.  We need to be able to easily handle 1000s
>> of
>>>>>>>>>>>> samples.
>>>>>>>>>>>>
>>>>>>>>>>>> What is the chance that we can get the option of having the
>>>>>>>>>>>> assayData
>>>>>>>>>>>> inside an environment, perhaps by
>>>>>>>>>>>>      Making a class that is an environment and inherits from
>>>>>>>>>>>>
>>>>>>>>>>> SimpleList.
>>>>>>>>
>>>>>>>>>       Using a classUnion between the existing class of the assayData
>>>>>>>>>>>> and
>>>>>>>>>>>> an environment.
>>>>>>>>>>>>      Third option that is probably better than the proceeding
>> two,
>>>>>>>>>>>> but
>>>>>>>>>>>> which I cannot come up with right now.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Probably something can / will be done. I guess the slowness
>> you're
>>>>>>>>>>>
>>>>>>>>>> talking
>>>>>>>>>>
>>>>>>>>>>> about is when rowData / colData columns are manipulated; any
>> kind of
>>>>>>>>>>> subsetting would mean a 'deep' copy. Martin
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, for example manipulating colData - something that
>> conceptually
>>>>>>>>>> should be quick and easy.  Of course, this will not affect any
>> real
>>>>>>>>>> computation on the assayData matrices, but it will make life at
>> the
>>>>>>>>>> command prompt more pleasant.
>>>>>>>>>>
>>>>>>>>>> Kasper
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> This would - in my opinion - be very nice and worthwhile.
>>>>>>>>>>>>
>>>>>>>>>>>> Kasper
>>>>>>>>>>>>
>>>>>>>>>>>> ______________________________**_________________
>>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>>>>
>>>>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>>>>> Phone: (206) 667-2793
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ______________________________**_________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *A model is a lie that helps you see the truth.*
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> Howard Skipper<
>>>>>>>>>
>>>>>>>> http://cancerres.aacrjournals.**org/content/31/9/1173.full.pdf<
>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>>>>> **>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *A model is a lie that helps you see the truth.*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> Howard Skipper<
>>>>>>>> http://cancerres.aacrjournals.**org/content/31/9/1173.full.pdf<
>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>>>>> **>
>>>>>>>>
>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> ______________________________**_________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>>
>>>>>>>>
>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *A model is a lie that helps you see the truth.*
>>> *
>>> *
>>> Howard Skipper<
>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list