[Bioc-devel] SummarizedExperiments

Thu Aug 30 16:53:56 CEST 2012

My perspective (and the reason for me email) is pretty simple.

First off, let me say that we will always be working on trading off
memory and IO and development time and all the other stuff.  I mean,
clearly the environment stuff for ExpressionSets did not help all
people, because we have xps, aroma.affymetrix and oligo all using some
kind of file based backend.

I am not trying to solve something "big".  Putting methylation data
inside an environment is not suddenly going to make my algorithms go
_significantly_ faster or the memory usage to go _significantly_ down.
 But it makes life faster when you explore data - I believe.

Now for two realistic examples (one minfi, one bsseq) that I encounter
all the time:
library(GenomicRanges)
nRows <- 450000
nCols <- 1000
SSet <- SummarizedExperiment(rowData = GRanges(seqnames = rep("chr1",
nRows),
                             ranges = IRanges(1:nRows, width = 1)),
                             colData = DataFrame(sampleID =
paste0("person", 1:nCols)),
                             exptData = SimpleList(meth = matrix(0.2,
ncol = nCols, nrow = nRows),
                             unmeth = matrix(0.2, ncol = nCols, nrow =
nRows)))
print(object.size(SSet), units = "auto")  # 6.7GB
system.time({
    colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
}) # around 26 sec

nRows <- 28000000
nCols <- 10
SSet <- SummarizedExperiment(rowData = GRanges(seqnames = rep("chr1",
nRows),
                             ranges = IRanges(1:nRows, width = 1)),
                             colData = DataFrame(sampleID =
paste0("person", 1:nCols)),
                             exptData = SimpleList(meth = matrix(0.2,
ncol = nCols, nrow = nRows),
                             unmeth = matrix(0.2, ncol = nCols, nrow =
nRows)))
print(object.size(SSet), units = "auto")  # 4.4GB
system.time({
    colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
}) # around 20 sec

This is of course not that slow, especially not when compared to the
massive amount of computation I had to do in order to arrive at the
SSet in the first place.  But the difference is that I staring at the
screen when I work interactively, but I am out partying when I run my
computations (overnight).  And I do stuff like the above surprisingly
many times after I have generated the objects.

My perspective is that this was essentially "solved" with the
ExpressionSet class, and I feel we are regressing a bit.  I also
believe that it should be "easy" to re-use some of the ideas/code from
ExpressionSet and that it is worthwhile to do so, for something that
may very well be an extremely important core class.
SummarizedExperiment is such a nice class that I essentially have a
parallel implementation in bsseq and Pete has one in genoset.

So essentially, I was interested in fixing a pretty small problem, but
something that I am a bit irritated about.  And I feel we almost have
a ready to use solution.

The reason I ask about it here, and not just do it myself, is twofold
(1) it is tricky to get the environment right, and I do not have
enough experience on the environment part of ExpressionSet, that I
feel comfortable doing it
(2) I really think it would be beneficial to the community, and
experience tells us, that what happens in these core infrastructure
packages are much more likely to be widely used.

While I think that it may be worthwhile for people to play with
reference classes, I think that going down that route is unexplored
and may or may not be worth it.  There is something to be said for
building on something we have experience with.  A reference class idea
can be developed in parallel, if people are willing to commit the
time.

Kasper

On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
>  I am interested in how best to benchmark such operations (and another
> thought that I recently had, "why can't I put a Doug Bates style 'Matrix'
> in the SimpleList of assay data)?  Kasper is the person who convinced me
> that ff, rhdf5, etc.should be a last resort, since RAM is cheap.
>
> For large data, the TCGA BRCA samples are a hair under 800 (last I looked)
> 450k arrays with matching RNAseq and DNAseq, and that's not even a big
> study by breast cancer standards (IIRC, Gordon Mills was rounding up 5000
> tumors) Both minfi and methylumi can work off of the raw data in these
> studies as a "practical" example, but when it really gets interesting is
> when 3-4 ragged assays exist for each patient.
>
> Exon-wise (or splice-graph-wise) RNAseq data with splice junction read
> counts is another place to find big SE-centric data with ambiguities.   One
> of the reasons I keep coming back to the idea of subsetting an SE by a
> GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is its speed. In any
> event, after my obligations for the day are dispatched, I'll write up an
> example of doing this off of TCGA data, perhaps from colorectal (only about
> 400 subjects but at least it is published already).  You are right, a
> single use case is often worth 1000 specifications.
>
> --t
>
>
>
> On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
> <stvjc at channing.harvard.edu>wrote:
>
>>
>>
>> On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>>
>>>
>>>
>>> On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan <mtmorgan at fhcrc.org>wrote:
>>>
>>>> On 08/30/2012 04:42 AM, Vincent Carey wrote:
>>>>
>>>>> On Thu, Aug 30, 2012 at 6:27 AM, Tim Triche, Jr. <tim.triche at gmail.com
>>>>> >wrote:
>>>>>
>>>>>  nb.  one of the reasons for the existence of the MergedDataSet class in
>>>>>> regulatoR (to be submitted for review shortly) is that, while SEs are
>>>>>> absolutely fantastic for managing data matrices that are stapled to a
>>>>>> GRanges, what is less awesome is having a relatively light-weight
>>>>>> DataFrame
>>>>>> for phenotypic data that requires the entire memory footprint be
>>>>>> recreated
>>>>>> upon writing a new column into said DataFrame.
>>>>>>
>>>>>> If R5 classes didn't spook me a little, I would already have done
>>>>>> something
>>>>>>
>>>>>>
>>>>> We don't often use this R5 terminology but I see Hadley has made an
>>>>> accessible document referring to reference classes in this way.
>>>>>
>>>>
>>>> To me the challenge is more conceptual -- pass-by-reference and the way
>>>> that two variables pointing to the instance are updated at the same time --
>>>> and I had been thinking of a LockedEnvironment-style implementation where
>>>> some operations were free ('copying') but others weren't (subset, subset
>>>> assign). But maybe there are some more direct approaches...
>>>>
>>>>
>>>>  My 2c: This is a situation where some experimental data would be
>>>>> helpful.
>>>>>
>>>>
>>>>
>>> Perhaps Tim or Kasper can share a largish (unpublished) dataset and a
>>> typical workflow with the Seattle folks.  Even that level of detail would
>>> give some sense of the scope of the problem.
>>>
>>
>>
>> For clarification, I did not mean large data, although that would be
>> welcome.  I meant data on computational experiments with different
>> approaches -- that is, performance statistics that would indicate where
>> scalability fails, what steps have been taken to recover it, and how costly
>> those steps are in terms of complexity, reliability, maintainability, risk
>> at the user/application end.
>>
>>
>>
>>> Sean
>>>
>>>
>>>> Yes, for instance where in the interactive use is time being spent? Is
>>>> it copying the assays, or validity, or actually updating the row data? Is
>>>> 500000 x 800 an appropriate scale to be thinking about?
>>>>
>>>>
>>>>    The main avenues for a developer seem to be a) use environments or
>>>>> reference classes; there are some costs and we should understand them,
>>>>> and
>>>>> b) use an out-of-memory approach like rhdf5 or ff.  Again there will be
>>>>> some costs.  It should be relatively easy to experiment with these.  One
>>>>> thing I just learned about is setValidity2 and disableValidity (defined
>>>>> in
>>>>> IRanges IIRC) ... these allow you to construct certain variations on
>>>>> SummarizedExperiment with less attention to deeper infrastructure.
>>>>>
>>>>
>>>> probably I can make better use of the insights the IRanges guys have had
>>>> in their careful development and application of validity methods, though I
>>>> feel a bit like these are 'attractive hazards' that tempt us to do unsafe
>>>> things and then pay the price later. This is likely the first direction
>>>> I'll explore.
>>>>
>>>> Exploring a little I already see that there are some pretty dumb things
>>>> being done in assignment.
>>>>
>>>> Martin
>>>>
>>>>
>>>>  whereby the assays/libraries for a given study subject are all pointed
>>>>>> to
>>>>>> as SEs (i.e. RNAseq, BSseq, expression/methylation arrays, CNV/SNP
>>>>>> arrays,
>>>>>> WGS or exomic DNAseq) and the column (phenotype) data can avoid being
>>>>>> subject to these constraints.  Truth be told I *still* want to do that
>>>>>> because, most of the time, updates to the latter are independent of,
>>>>>> and
>>>>>> happen subsequently to, loading the former.
>>>>>>
>>>>>> Suggestions would be welcome, because other than these minor niggles,
>>>>>> the
>>>>>> SummarizedExperiment class is almost perfect for many tasks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 29, 2012 at 9:57 PM, Tim Triche, Jr. <tim.triche at gmail.com
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>  assigning new colData columns, or overwriting old ones, in a sizable
>>>>>>> (say
>>>>>>> 500000 row x 800 column) SE is nauseatingly slow.
>>>>>>>
>>>>>>
>>>>
>>>>
>>>>>>> There has to be a better way -- I'm willing to write it if someone can
>>>>>>> point out an obvious way to do it
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 29, 2012 at 9:52 PM, Kasper Daniel Hansen <
>>>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>>>
>>>>>>>  On Thu, Aug 30, 2012 at 12:44 AM, Martin Morgan <mtmorgan at fhcrc.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On 08/29/2012 06:46 PM, Kasper Daniel Hansen wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There is a lot of good stuff to say about SummarizedExperiments,
>>>>>>>>>> and
>>>>>>>>>> from a certain point of view I have a parallel implementation in
>>>>>>>>>>
>>>>>>>>> bsseq
>>>>>>
>>>>>>>  (and there is also one in genoset).
>>>>>>>>>>
>>>>>>>>>> However, I really like having the assayData inside an environment.
>>>>>>>>>> This helps some on memory and - equally important - speed at the
>>>>>>>>>> command line.  I certainly need to very heavily consider using an
>>>>>>>>>> environment in bsseq.
>>>>>>>>>>
>>>>>>>>>> After some discussion with Tim (Triche) we have agreed that
>>>>>>>>>> something
>>>>>>>>>> like SummarizedExperiments is the way to go at least for the
>>>>>>>>>> methylation arrays.  We need to be able to easily handle 1000s of
>>>>>>>>>> samples.
>>>>>>>>>>
>>>>>>>>>> What is the chance that we can get the option of having the
>>>>>>>>>> assayData
>>>>>>>>>> inside an environment, perhaps by
>>>>>>>>>>     Making a class that is an environment and inherits from
>>>>>>>>>>
>>>>>>>>> SimpleList.
>>>>>>
>>>>>>>      Using a classUnion between the existing class of the assayData
>>>>>>>>>> and
>>>>>>>>>> an environment.
>>>>>>>>>>     Third option that is probably better than the proceeding two,
>>>>>>>>>> but
>>>>>>>>>> which I cannot come up with right now.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Probably something can / will be done. I guess the slowness you're
>>>>>>>>>
>>>>>>>> talking
>>>>>>>>
>>>>>>>>> about is when rowData / colData columns are manipulated; any kind of
>>>>>>>>> subsetting would mean a 'deep' copy. Martin
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, for example manipulating colData - something that conceptually
>>>>>>>> should be quick and easy.  Of course, this will not affect any real
>>>>>>>> computation on the assayData matrices, but it will make life at the
>>>>>>>> command prompt more pleasant.
>>>>>>>>
>>>>>>>> Kasper
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> This would - in my opinion - be very nice and worthwhile.
>>>>>>>>>>
>>>>>>>>>> Kasper
>>>>>>>>>>
>>>>>>>>>> ______________________________**_________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>>>>> 1100 Fairview Ave. N.
>>>>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>>>>
>>>>>>>>> Location: Arnold Building M1 B861
>>>>>>>>> Phone: (206) 667-2793
>>>>>>>>>
>>>>>>>>
>>>>>>>> ______________________________**_________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *A model is a lie that helps you see the truth.*
>>>>>>> *
>>>>>>> *
>>>>>>> Howard Skipper<
>>>>>>>
>>>>>> http://cancerres.aacrjournals.**org/content/31/9/1173.full.pdf<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>>> **>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> *A model is a lie that helps you see the truth.*
>>>>>> *
>>>>>> *
>>>>>> Howard Skipper<
>>>>>> http://cancerres.aacrjournals.**org/content/31/9/1173.full.pdf<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>>> **>
>>>>>>
>>>>>>          [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>
>>>
>>
>
>
> --
> *A model is a lie that helps you see the truth.*
> *
> *
> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel