[Bioc-devel] SummarizedExperiments

Martin Morgan mtmorgan at fhcrc.org
Thu Aug 30 14:59:12 CEST 2012


On 08/30/2012 04:42 AM, Vincent Carey wrote:
> On Thu, Aug 30, 2012 at 6:27 AM, Tim Triche, Jr. <tim.triche at gmail.com>wrote:
>
>> nb.  one of the reasons for the existence of the MergedDataSet class in
>> regulatoR (to be submitted for review shortly) is that, while SEs are
>> absolutely fantastic for managing data matrices that are stapled to a
>> GRanges, what is less awesome is having a relatively light-weight DataFrame
>> for phenotypic data that requires the entire memory footprint be recreated
>> upon writing a new column into said DataFrame.
>>
>> If R5 classes didn't spook me a little, I would already have done something
>>
>
> We don't often use this R5 terminology but I see Hadley has made an
> accessible document referring to reference classes in this way.

To me the challenge is more conceptual -- pass-by-reference and the way 
that two variables pointing to the instance are updated at the same time 
-- and I had been thinking of a LockedEnvironment-style implementation 
where some operations were free ('copying') but others weren't (subset, 
subset assign). But maybe there are some more direct approaches...

> My 2c: This is a situation where some experimental data would be helpful.

Yes, for instance where in the interactive use is time being spent? Is 
it copying the assays, or validity, or actually updating the row data? 
Is 500000 x 800 an appropriate scale to be thinking about?

>   The main avenues for a developer seem to be a) use environments or
> reference classes; there are some costs and we should understand them, and
> b) use an out-of-memory approach like rhdf5 or ff.  Again there will be
> some costs.  It should be relatively easy to experiment with these.  One
> thing I just learned about is setValidity2 and disableValidity (defined in
> IRanges IIRC) ... these allow you to construct certain variations on
> SummarizedExperiment with less attention to deeper infrastructure.

probably I can make better use of the insights the IRanges guys have had 
in their careful development and application of validity methods, though 
I feel a bit like these are 'attractive hazards' that tempt us to do 
unsafe things and then pay the price later. This is likely the first 
direction I'll explore.

Exploring a little I already see that there are some pretty dumb things 
being done in assignment.

Martin

>> whereby the assays/libraries for a given study subject are all pointed to
>> as SEs (i.e. RNAseq, BSseq, expression/methylation arrays, CNV/SNP arrays,
>> WGS or exomic DNAseq) and the column (phenotype) data can avoid being
>> subject to these constraints.  Truth be told I *still* want to do that
>> because, most of the time, updates to the latter are independent of, and
>> happen subsequently to, loading the former.
>>
>> Suggestions would be welcome, because other than these minor niggles, the
>> SummarizedExperiment class is almost perfect for many tasks.
>>
>>
>>
>>
>> On Wed, Aug 29, 2012 at 9:57 PM, Tim Triche, Jr. <tim.triche at gmail.com
>>> wrote:
>>
>>> assigning new colData columns, or overwriting old ones, in a sizable (say
>>> 500000 row x 800 column) SE is nauseatingly slow.


>>>
>>> There has to be a better way -- I'm willing to write it if someone can
>>> point out an obvious way to do it
>>>
>>>
>>>
>>> On Wed, Aug 29, 2012 at 9:52 PM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>>
>>>> On Thu, Aug 30, 2012 at 12:44 AM, Martin Morgan <mtmorgan at fhcrc.org>
>>>> wrote:
>>>>> On 08/29/2012 06:46 PM, Kasper Daniel Hansen wrote:
>>>>>>
>>>>>> There is a lot of good stuff to say about SummarizedExperiments, and
>>>>>> from a certain point of view I have a parallel implementation in
>> bsseq
>>>>>> (and there is also one in genoset).
>>>>>>
>>>>>> However, I really like having the assayData inside an environment.
>>>>>> This helps some on memory and - equally important - speed at the
>>>>>> command line.  I certainly need to very heavily consider using an
>>>>>> environment in bsseq.
>>>>>>
>>>>>> After some discussion with Tim (Triche) we have agreed that something
>>>>>> like SummarizedExperiments is the way to go at least for the
>>>>>> methylation arrays.  We need to be able to easily handle 1000s of
>>>>>> samples.
>>>>>>
>>>>>> What is the chance that we can get the option of having the assayData
>>>>>> inside an environment, perhaps by
>>>>>>     Making a class that is an environment and inherits from
>> SimpleList.
>>>>>>     Using a classUnion between the existing class of the assayData and
>>>>>> an environment.
>>>>>>     Third option that is probably better than the proceeding two, but
>>>>>> which I cannot come up with right now.
>>>>>
>>>>>
>>>>> Probably something can / will be done. I guess the slowness you're
>>>> talking
>>>>> about is when rowData / colData columns are manipulated; any kind of
>>>>> subsetting would mean a 'deep' copy. Martin
>>>>
>>>> Yes, for example manipulating colData - something that conceptually
>>>> should be quick and easy.  Of course, this will not affect any real
>>>> computation on the assayData matrices, but it will make life at the
>>>> command prompt more pleasant.
>>>>
>>>> Kasper
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> This would - in my opinion - be very nice and worthwhile.
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: Arnold Building M1 B861
>>>>> Phone: (206) 667-2793
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>
>>>
>>> --
>>> *A model is a lie that helps you see the truth.*
>>> *
>>> *
>>> Howard Skipper<
>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>
>>>
>>
>>
>> --
>> *A model is a lie that helps you see the truth.*
>> *
>> *
>> Howard Skipper<
>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list