[Bioc-devel] SummarizedExperiments

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Fri Sep 7 00:43:44 CEST 2012


On Thu, Sep 6, 2012 at 5:56 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
> Regarding "legacy" eSet-backwards-compatible methods, I implemented a few in regulatoR, and then I ditched them when I saw the logic in stripping away some of this accumulated cruft.  Perhaps it is best to have them for backwards compatibility, since eSets are everywhere. (my response to an eSet these days is: try to coerce it to an SE, which is another can of worms)
>
> I think all the seqinfo-based methods are implemented (thanks to MM).
>
> The last few are my personal favorite in terms of emphasis.  Suppose I have segmented the genome in some cell type (say CD34+ HSCs) and I want to get the subset of features in an SE that fall into the "enhancer" or "insulator" states.  It's nice to be able to do this in one call, so I wanted to use the bracket syntax SE[GR, ANY]. My solution for subsetByOverlaps, such as it is, was
>
> SE[ i=GR, j=ANY, ... ] =
> SE[unique(queryHits(findOverlaps(rowData(SE), GR, ...))), j]
>
> The latter formulation shows "why".  I really want this in GenomicRanges!
>
> This, to me, is the killer feature of an SE -- the sorts of antics that sold me on SEs in the first place -- and it's the ability to do so that sets them apart from eSets.  Passing rowData to findOverlaps and subsetByOverlaps is good, but I claim that this is better.
>
> Objections from the cognoscenti?

Zero'th, regarding sampleNames etc.  Well, I believe we have so much
documentation and usage for those three methods (sampleNames,
featureNames and pData) that we should keep them.  Especially pData.

Now, to the subsetting

First, I believe we should certainly implement subsetBy/findOverlaps
for SummarizedExperiments.  It is pretty straightforward.

Second, your issue here is not so much about SummarizedExperiments as
it is about GRanges in general.  You could use the exact same argument
for why we should be about to do something like
  gr1[gr2]
and have that be
  subsetByOverlaps(gr1, gr2)

I can see the use of this type of subsetting - like Tim, I do it all
the time.  The pro is that it is short notation and it seems natural.
The con is that subsetByOverlaps takes additional arguments and what
should we do with those (put them in ... of the method?) And there may
be additional issues with '[' which sometimes is a bit tricky.
Perhaps all of this was considered and decided against when
subsetBYOverlaps was introduced.

Personally, I think this may be nice, but I am not unhappy with
subsetByOverlaps.

Kasper

> On Sep 6, 2012, at 2:26 PM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
>
>> Alltogether, this looks great.  I am still implementing/testing, but
>> seems pretty nice.
>>
>> Here is a number of convenience methods I believe should be
>> implemented.  Some of them comes from eSet and some of them comes from
>> my own work in bsseq, but I have used all of them extensively in
>> various packages.
>>
>> sampleNames, featureNames, pData
>> granges, start, end, width, strand, seqnames, seqlengths, seqlevels
>> findOverlaps, subsetByOverlaps
>>
>> 'granges' is a method already defined in GenomicRanges, I have found
>> it very convenient.  It is not clear to me why it is all lowercase
>>
>> What I have done in bsseq for the later two lines of methods, is to
>> have a class 'hasGRanges' that designate any object with a GRanges.
>> Then implement the granges() method and the rest goes from there.  Not
>> sure whether that makes sense in this context, but in general I have
>> found it very convenient for my own work.  Code for all of this is in
>> bsseq/R/hasGRanges.R.  I have two classes in the package inheriting
>> from hasGRanges.  Note that the classes in bsseq have not been updated
>> to build on the new SummarizedExperiment (yet).
>>
>> Kasper
>>
>>
>> On Wed, Sep 5, 2012 at 1:31 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
>>> It seems to work with a Matrix in the assays:
>>>
>>> R> packageVersion('GenomicRanges')
>>> [1] ‘1.9.61’
>>>
>>> R> SE <- SummarizedExperiment(assays=wts, rowData=wts.locs)
>>> Error in function (classes, fdef, mtable)  :
>>>  unable to find an inherited method for function "SummarizedExperiment",
>>> for signature "dgCMatrix"
>>>
>>> R> SE <- SummarizedExperiment(assays=SimpleList(weights=wts),
>>> rowData=wts.locs)
>>> dimnames(.) <- NULL:  translated to
>>> dimnames(.) <- list(NULL,NULL)  <==>  unname(.)
>>>
>>> R> show(SE)
>>> class: SummarizedExperiment
>>> dim: 35855 35855
>>> exptData(0):
>>> assays(1): weights
>>> rownames(35855): feat1 feat2 ... feat35854 feat35855
>>> rowData metadata column names(10): egid signed ... DHS_DGF chromHMM_state
>>> colnames(35855): feat1 feat2 ... feat35854 feat35855
>>> colData names(0):
>>>
>>> R> class(assays(SE, 'weights'))
>>> [1] "SimpleList"
>>> attr(,"package")
>>> [1] "IRanges"
>>>
>>> R> class(assays(SE)$weights)
>>> [1] "dgCMatrix"
>>> attr(,"package")
>>> [1] "Matrix"
>>>
>>>
>>> Very cool!  Albeit a tad disorienting.  I was going to patch this up and
>>> then realized, hey, if someone's going to stuff a Matrix into their assays,
>>> they ought to give it a name and know what's going on under the hood, at
>>> least for now.  It seems to work correctly, too.
>>>
>>> Oh oh, I spoke too soon.  If I split by chromosome, the assay falls out:
>>>
>>> R> head(assays(SE)$weights)
>>> 6 x 35855 sparse Matrix of class "dgCMatrix"
>>>   [[ suppressing 82 column names ‘feat1’, ‘feat2’, ‘feat3’ ... ]]
>>>
>>> feat1 1 . .         .         .          .          . . . . . . . . . . . .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> . . . . . . . . . . . .
>>> feat2 . 1 .         .         .          .          . . . . . . . . . . . .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> . . . . . . . . . . . .
>>> feat3 . . 1.0000000 0.5737488 .          .          . . . . . . . . . . . .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> . . . . . . . . . . . .
>>> feat4 . . 0.5737488 1.0000000 .          .          . . . . . . . . . . . .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> . . . . . . . . . . . .
>>> feat5 . . .         .         1.00000000 0.04300373 . . . . . . . . . . . .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> . . . . . . . . . . . .
>>> feat6 . . .         .         0.04300373 1.00000000 . . . . . . . . . . . .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> . . . . . . . . . . . .
>>>
>>> feat1 . . . . . . . . . . . . . . ......
>>> feat2 . . . . . . . . . . . . . . ......
>>> feat3 . . . . . . . . . . . . . . ......
>>> feat4 . . . . . . . . . . . . . . ......
>>> feat5 . . . . . . . . . . . . . . ......
>>> feat6 . . . . . . . . . . . . . . ......
>>>
>>> .....suppressing columns in show(); maybe adjust 'options(max.print= *)'
>>> ..............................
>>> R> SE.by.chr <- split(SE, as.vector(seqnames(SE)))
>>> R> head(assays(SE.by.chr)$weights)
>>> Error in head(assays(SE.by.chr)$weights) :
>>>  error in evaluating the argument 'x' in selecting a method for function
>>> 'head': Error in function (classes, fdef, mtable)  :
>>>  unable to find an inherited method for function "assays", for signature
>>> "list"
>>>
>>> Ideas?  I would love to do something useful here instead of just fussing.
>>> It's really a great structure.
>>>
>>> I love the use of a reference class to reference the internal pieces of an
>>> object, it's such a clever hack.  Thanks much for adding it.
>>>
>>> --t
>>>
>>>
>>>
>>> On Tue, Sep 4, 2012 at 7:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>>
>>>> On 09/04/2012 07:02 PM, Tim Triche, Jr. wrote:
>>>>>
>>>>> This is really cool.
>>>>>
>>>>> R> head(lsos(), 1)
>>>>>                             Type     Size  Rows Columns
>>>>> LAML.RNAseq SummarizedExperiment 38666528 23529     173
>>>>> R> LAML.RNAseq.updated <- updateObject(LAML.RNAseq)
>>>>> R> head(lsos(), 2)
>>>>>                                     Type     Size  Rows Columns
>>>>> LAML.RNAseq         SummarizedExperiment 38666528 23529     173
>>>>> LAML.RNAseq.updated SummarizedExperiment  6101488 23529     173
>>>>
>>>>
>>>> I think either the sizes are misleading (missing the content of the
>>>> environment that the reference classes contain, probably
>>>> LAML.RNAseq.updated at assays$data has size 38666528 - 6101488) or the data has
>>>> not been updated correctly :\. Nonetheless manipulation of especially
>>>> non-assay but also assay data should be faster.
>>>>
>>>> I added a brief section (1.9.60) to ?SummarizedExperiment describing what
>>>> was done. Also, I tried (w/out actually testing...) to add support for
>>>> Matrix instances.
>>>>
>>>> As a miscellaneous 'tip', when working with large data I find it useful to
>>>> start R with --min-vsize=2048M --min-nsize=20M. These are documented in
>>>> RShowDoc("R-intro") as 'for expert use only' and influence how much memory R
>>>> allocates for vectors ('vsize'; vsize can be very large) and for
>>>> S-expressions ('nsize' 50000000) and help get up to large memory allocations
>>>> without innumerable garbage collections. Of course having big memory is of
>>>> primary importance.
>>>>
>>>> Martin
>>>>
>>>>> R> str(LAML.RNAseq.updated at assays)
>>>>> Reference class 'ShallowSimpleListAssays' [package "GenomicRanges"] with
>>>>> 1 fields
>>>>>  $ data:Formal class 'SimpleList' [package "IRanges"] with 4 slots
>>>>>   .. ..@ listData       :List of 1
>>>>>   .. .. ..$ rpm: num [1:23529, 1:173] 0 0 8.12 10.84 27.73 ...
>>>>>   .. ..@ elementType    : chr "ANY"
>>>>>   .. ..@ elementMetadata: NULL
>>>>>   .. ..@ metadata       : list()
>>>>>  and 11 methods,R>
>>>>> R> packageVersion('GenomicRanges')
>>>>> [1] ‘1.9.59’
>>>>>
>>>>> A slightly bigger one:
>>>>>
>>>>> R> head(lsos(),2)
>>>>>                              Type       Size   Rows Columns
>>>>> LAML         SummarizedExperiment 1950606472 485577     192
>>>>> LAML.updated SummarizedExperiment  458912760 485577     192
>>>>> R> system.time(LAML.updated$foo <- rep('bar', 192))
>>>>>    user  system elapsed
>>>>>   0.132   0.116   0.248
>>>>> R> system.time(LAML$foo <- rep('bar', 192))
>>>>>    user  system elapsed
>>>>>   2.728   2.144   7.519
>>>>>
>>>>>
>>>>> That is a really clever hack you added.  I think I will have to figure
>>>>> out how it works :-)
>>>>>
>>>>> Thanks!!!
>>>>>
>>>>>
>>>>> On Sun, Sep 2, 2012 at 2:47 PM, Martin Morgan <mtmorgan at fhcrc.org
>>>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>
>>>>>    On 08/30/2012 08:47 AM, Vincent Carey wrote:
>>>>>
>>>>>        I am all in favor of the dialogue and the aim.  I reproduced
>>>>>        your first
>>>>>        timing, and note
>>>>>
>>>>>            unix.time(ss2 at colData$tx <- 1:1000)
>>>>>
>>>>>             user  system elapsed
>>>>>            5.963   4.520  10.483
>>>>>
>>>>>            unix.time(colData(ss2)$tx2 <- 1:1000)
>>>>>
>>>>>             user  system elapsed
>>>>>           17.937  13.074  31.016
>>>>>
>>>>>        BTW did you really mean to put the meth/unmeth in exptData, or
>>>>>        should it be
>>>>>        in assays?
>>>>>
>>>>>             From an experimentation perspective, I would want to redo
>>>>> these
>>>>>
>>>>>        computations above with an environment holding the assay data
>>>>>        and see how
>>>>>        it changes.  I suppose one can infer from first principles that
>>>>>
>>>>>            ss4 at assays[[1]]$meth[1:4,1:4]
>>>>>
>>>>>               [,1] [,2] [,3] [,4]
>>>>>        [1,]  0.2  0.2  0.2  0.2
>>>>>        [2,]  0.2  0.2  0.2  0.2
>>>>>        [3,]  0.2  0.2  0.2  0.2
>>>>>        [4,]  0.2  0.2  0.2  0.2
>>>>>
>>>>>            ss4 at assays[[2]]$unmeth[1:4,1:__4]
>>>>>
>>>>>
>>>>>               [,1] [,2] [,3] [,4]
>>>>>        [1,]  0.2  0.2  0.2  0.2
>>>>>        [2,]  0.2  0.2  0.2  0.2
>>>>>        [3,]  0.2  0.2  0.2  0.2
>>>>>        [4,]  0.2  0.2  0.2  0.2
>>>>>
>>>>>            unix.time(colData(ss4)$tx2 <- 1:1000)
>>>>>
>>>>>             user  system elapsed
>>>>>            0.010   0.002   0.012
>>>>>
>>>>>        where ss4 has a SimpleList of environments holding the assay
>>>>> data.
>>>>>
>>>>>            ss4
>>>>>
>>>>>        class: SummarizedExperiment
>>>>>        dim: 450000 1000
>>>>>        exptData(0):
>>>>>        assays(2): '' ''
>>>>>        rownames: NULL
>>>>>        rowData values names(0):
>>>>>        colnames: NULL
>>>>>        colData names(4): sampleID treatment tx tx2
>>>>>
>>>>>            ss4 at assays[[1]]
>>>>>
>>>>>        <environment: 0x1f4e0ba0>
>>>>>
>>>>>        This can't be attempted without disabling validity checking, and
>>>>>        the stack
>>>>>        trace from the attempt is instructive.
>>>>>
>>>>>        I don't think it will be too difficult to establish a customized
>>>>>        SummarizedExperiment that permits this, using the
>>>>> lockedEnvironment
>>>>>        approach of ExpressionSets.  I also do not know the down sides
>>>>>        but it seems
>>>>>        you/we could get some mileage on the approach and see.
>>>>>
>>>>>
>>>>>    GenomicRanges v. 1.9.59 has some preliminary changes that address
>>>>>    the performance issue. Serialized (saved) objects 'x' need to be
>>>>>    updated with updateObject(x) (and be careful if these are your only
>>>>>    instances of some hard work). There are some loose ends to tidy up.
>>>>>
>>>>>    Martin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>        On Thu, Aug 30, 2012 at 10:53 AM, Kasper Daniel Hansen <
>>>>>        kasperdanielhansen at gmail.com
>>>>>        <mailto:kasperdanielhansen at gmail.com>> wrote:
>>>>>
>>>>>            My perspective (and the reason for me email) is pretty
>>>>> simple.
>>>>>
>>>>>            First off, let me say that we will always be working on
>>>>>            trading off
>>>>>            memory and IO and development time and all the other stuff.
>>>>>              I mean,
>>>>>            clearly the environment stuff for ExpressionSets did not
>>>>>            help all
>>>>>            people, because we have xps, aroma.affymetrix and oligo all
>>>>>            using some
>>>>>            kind of file based backend.
>>>>>
>>>>>            I am not trying to solve something "big".  Putting
>>>>>            methylation data
>>>>>            inside an environment is not suddenly going to make my
>>>>>            algorithms go
>>>>>            _significantly_ faster or the memory usage to go
>>>>>            _significantly_ down.
>>>>>               But it makes life faster when you explore data - I
>>>>> believe.
>>>>>
>>>>>            Now for two realistic examples (one minfi, one bsseq) that I
>>>>>            encounter
>>>>>            all the time:
>>>>>            library(GenomicRanges)
>>>>>            nRows <- 450000
>>>>>            nCols <- 1000
>>>>>            SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>>            rep("chr1",
>>>>>            nRows),
>>>>>                                           ranges = IRanges(1:nRows,
>>>>>            width = 1)),
>>>>>                                           colData = DataFrame(sampleID =
>>>>>            paste0("person", 1:nCols)),
>>>>>                                           exptData = SimpleList(meth =
>>>>>            matrix(0.2,
>>>>>            ncol = nCols, nrow = nRows),
>>>>>                                           unmeth = matrix(0.2, ncol =
>>>>>            nCols, nrow =
>>>>>            nRows)))
>>>>>            print(object.size(SSet), units = "auto")  # 6.7GB
>>>>>            system.time({
>>>>>                  colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
>>>>>            }) # around 26 sec
>>>>>
>>>>>
>>>>>            nRows <- 28000000
>>>>>            nCols <- 10
>>>>>            SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>>            rep("chr1",
>>>>>            nRows),
>>>>>                                           ranges = IRanges(1:nRows,
>>>>>            width = 1)),
>>>>>                                           colData = DataFrame(sampleID =
>>>>>            paste0("person", 1:nCols)),
>>>>>                                           exptData = SimpleList(meth =
>>>>>            matrix(0.2,
>>>>>            ncol = nCols, nrow = nRows),
>>>>>                                           unmeth = matrix(0.2, ncol =
>>>>>            nCols, nrow =
>>>>>            nRows)))
>>>>>            print(object.size(SSet), units = "auto")  # 4.4GB
>>>>>            system.time({
>>>>>                  colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
>>>>>            }) # around 20 sec
>>>>>
>>>>>            This is of course not that slow, especially not when
>>>>>            compared to the
>>>>>            massive amount of computation I had to do in order to arrive
>>>>>            at the
>>>>>            SSet in the first place.  But the difference is that I
>>>>>            staring at the
>>>>>            screen when I work interactively, but I am out partying when
>>>>>            I run my
>>>>>            computations (overnight).  And I do stuff like the above
>>>>>            surprisingly
>>>>>            many times after I have generated the objects.
>>>>>
>>>>>            My perspective is that this was essentially "solved" with the
>>>>>            ExpressionSet class, and I feel we are regressing a bit.  I
>>>>> also
>>>>>            believe that it should be "easy" to re-use some of the
>>>>>            ideas/code from
>>>>>            ExpressionSet and that it is worthwhile to do so, for
>>>>>            something that
>>>>>            may very well be an extremely important core class.
>>>>>            SummarizedExperiment is such a nice class that I essentially
>>>>>            have a
>>>>>            parallel implementation in bsseq and Pete has one in genoset.
>>>>>
>>>>>            So essentially, I was interested in fixing a pretty small
>>>>>            problem, but
>>>>>            something that I am a bit irritated about.  And I feel we
>>>>>            almost have
>>>>>            a ready to use solution.
>>>>>
>>>>>            The reason I ask about it here, and not just do it myself,
>>>>>            is twofold
>>>>>            (1) it is tricky to get the environment right, and I do not
>>>>> have
>>>>>            enough experience on the environment part of ExpressionSet,
>>>>>            that I
>>>>>            feel comfortable doing it
>>>>>            (2) I really think it would be beneficial to the community,
>>>>> and
>>>>>            experience tells us, that what happens in these core
>>>>>            infrastructure
>>>>>            packages are much more likely to be widely used.
>>>>>
>>>>>            While I think that it may be worthwhile for people to play
>>>>> with
>>>>>            reference classes, I think that going down that route is
>>>>>            unexplored
>>>>>            and may or may not be worth it.  There is something to be
>>>>>            said for
>>>>>            building on something we have experience with.  A reference
>>>>>            class idea
>>>>>            can be developed in parallel, if people are willing to
>>>>>            commit the
>>>>>            time.
>>>>>
>>>>>            Kasper
>>>>>
>>>>>
>>>>>            On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr.
>>>>>            <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>>>>
>>>>>            wrote:
>>>>>
>>>>>                   I am interested in how best to benchmark such
>>>>>                operations (and another
>>>>>                thought that I recently had, "why can't I put a Doug
>>>>>                Bates style 'Matrix'
>>>>>                in the SimpleList of assay data)?  Kasper is the person
>>>>>                who convinced me
>>>>>                that ff, rhdf5, etc.should be a last resort, since RAM
>>>>>                is cheap.
>>>>>
>>>>>                For large data, the TCGA BRCA samples are a hair under
>>>>>                800 (last I
>>>>>
>>>>>            looked)
>>>>>
>>>>>                450k arrays with matching RNAseq and DNAseq, and that's
>>>>>                not even a big
>>>>>                study by breast cancer standards (IIRC, Gordon Mills was
>>>>>                rounding up 5000
>>>>>                tumors) Both minfi and methylumi can work off of the raw
>>>>>                data in these
>>>>>                studies as a "practical" example, but when it really
>>>>>                gets interesting is
>>>>>                when 3-4 ragged assays exist for each patient.
>>>>>
>>>>>                Exon-wise (or splice-graph-wise) RNAseq data with splice
>>>>>                junction read
>>>>>                counts is another place to find big SE-centric data with
>>>>>                ambiguities.
>>>>>
>>>>>            One
>>>>>
>>>>>                of the reasons I keep coming back to the idea of
>>>>>                subsetting an SE by a
>>>>>                GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is
>>>>>                its speed. In any
>>>>>                event, after my obligations for the day are dispatched,
>>>>>                I'll write up an
>>>>>                example of doing this off of TCGA data, perhaps from
>>>>>                colorectal (only
>>>>>
>>>>>            about
>>>>>
>>>>>                400 subjects but at least it is published already).  You
>>>>>                are right, a
>>>>>                single use case is often worth 1000 specifications.
>>>>>
>>>>>                --t
>>>>>
>>>>>
>>>>>
>>>>>                On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
>>>>>                <stvjc at channing.harvard.edu
>>>>>                <mailto:stvjc at channing.harvard.edu>>__wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                    On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis
>>>>>                    <sdavis2 at mail.nih.gov <mailto:sdavis2 at mail.nih.gov>>
>>>>>
>>>>>
>>>>>            wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                        On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan
>>>>>                        <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>>>>>
>>>>>
>>>>>                wrote:
>>>>>
>>>>>
>>>>>                            On 08/30/2012 04:42 AM, Vincent Carey wrote:
>>>>>
>>>>>                                On Thu, Aug 30, 2012 at 6:27 AM, Tim
>>>>>                                Triche, Jr. <
>>>>>
>>>>>            tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>>
>>>>>
>>>>>                                    wrote:
>>>>>
>>>>>
>>>>>                                   nb.  one of the reasons for the
>>>>>                                existence of the MergedDataSet
>>>>>
>>>>>            class in
>>>>>
>>>>>                                    regulatoR (to be submitted for
>>>>>                                    review shortly) is that, while SEs
>>>>>
>>>>>            are
>>>>>
>>>>>                                    absolutely fantastic for managing
>>>>>                                    data matrices that are stapled to
>>>>>
>>>>>            a
>>>>>
>>>>>                                    GRanges, what is less awesome is
>>>>>                                    having a relatively light-weight
>>>>>                                    DataFrame
>>>>>                                    for phenotypic data that requires
>>>>>                                    the entire memory footprint be
>>>>>                                    recreated
>>>>>                                    upon writing a new column into said
>>>>>                                    DataFrame.
>>>>>
>>>>>                                    If R5 classes didn't spook me a
>>>>>                                    little, I would already have done
>>>>>                                    something
>>>>>
>>>>>
>>>>>                                We don't often use this R5 terminology
>>>>>                                but I see Hadley has made an
>>>>>                                accessible document referring to
>>>>>                                reference classes in this way.
>>>>>
>>>>>
>>>>>                            To me the challenge is more conceptual --
>>>>>                            pass-by-reference and the
>>>>>
>>>>>            way
>>>>>
>>>>>                            that two variables pointing to the instance
>>>>>                            are updated at the same
>>>>>
>>>>>            time --
>>>>>
>>>>>                            and I had been thinking of a
>>>>>                            LockedEnvironment-style implementation
>>>>>
>>>>>            where
>>>>>
>>>>>                            some operations were free ('copying') but
>>>>>                            others weren't (subset,
>>>>>
>>>>>            subset
>>>>>
>>>>>                            assign). But maybe there are some more
>>>>>                            direct approaches...
>>>>>
>>>>>
>>>>>                               My 2c: This is a situation where some
>>>>>                            experimental data would be
>>>>>
>>>>>                                helpful.
>>>>>
>>>>>
>>>>>
>>>>>                        Perhaps Tim or Kasper can share a largish
>>>>>                        (unpublished) dataset and a
>>>>>                        typical workflow with the Seattle folks.  Even
>>>>>                        that level of detail
>>>>>
>>>>>            would
>>>>>
>>>>>                        give some sense of the scope of the problem.
>>>>>
>>>>>
>>>>>
>>>>>                    For clarification, I did not mean large data,
>>>>>                    although that would be
>>>>>                    welcome.  I meant data on computational experiments
>>>>>                    with different
>>>>>                    approaches -- that is, performance statistics that
>>>>>                    would indicate where
>>>>>                    scalability fails, what steps have been taken to
>>>>>                    recover it, and how
>>>>>
>>>>>            costly
>>>>>
>>>>>                    those steps are in terms of complexity, reliability,
>>>>>                    maintainability,
>>>>>
>>>>>            risk
>>>>>
>>>>>                    at the user/application end.
>>>>>
>>>>>
>>>>>
>>>>>                        Sean
>>>>>
>>>>>
>>>>>                            Yes, for instance where in the interactive
>>>>>                            use is time being spent? Is
>>>>>                            it copying the assays, or validity, or
>>>>>                            actually updating the row
>>>>>
>>>>>            data? Is
>>>>>
>>>>>                            500000 x 800 an appropriate scale to be
>>>>>                            thinking about?
>>>>>
>>>>>
>>>>>                                 The main avenues for a developer seem
>>>>>                            to be a) use environments or
>>>>>
>>>>>                                reference classes; there are some costs
>>>>>                                and we should understand
>>>>>
>>>>>            them,
>>>>>
>>>>>                                and
>>>>>                                b) use an out-of-memory approach like
>>>>>                                rhdf5 or ff.  Again there will
>>>>>
>>>>>            be
>>>>>
>>>>>                                some costs.  It should be relatively
>>>>>                                easy to experiment with these.
>>>>>
>>>>>               One
>>>>>
>>>>>                                thing I just learned about is
>>>>>                                setValidity2 and disableValidity
>>>>>
>>>>>            (defined
>>>>>
>>>>>                                in
>>>>>                                IRanges IIRC) ... these allow you to
>>>>>                                construct certain variations on
>>>>>                                SummarizedExperiment with less attention
>>>>>                                to deeper infrastructure.
>>>>>
>>>>>
>>>>>                            probably I can make better use of the
>>>>>                            insights the IRanges guys have
>>>>>
>>>>>            had
>>>>>
>>>>>                            in their careful development and application
>>>>>                            of validity methods,
>>>>>
>>>>>            though I
>>>>>
>>>>>                            feel a bit like these are 'attractive
>>>>>                            hazards' that tempt us to do
>>>>>
>>>>>            unsafe
>>>>>
>>>>>                            things and then pay the price later. This is
>>>>>                            likely the first
>>>>>
>>>>>            direction
>>>>>
>>>>>                            I'll explore.
>>>>>
>>>>>                            Exploring a little I already see that there
>>>>>                            are some pretty dumb
>>>>>
>>>>>            things
>>>>>
>>>>>                            being done in assignment.
>>>>>
>>>>>                            Martin
>>>>>
>>>>>
>>>>>                               whereby the assays/libraries for a given
>>>>>                            study subject are all
>>>>>
>>>>>            pointed
>>>>>
>>>>>                                    to
>>>>>                                    as SEs (i.e. RNAseq, BSseq,
>>>>>                                    expression/methylation arrays,
>>>>> CNV/SNP
>>>>>                                    arrays,
>>>>>                                    WGS or exomic DNAseq) and the column
>>>>>                                    (phenotype) data can avoid
>>>>>
>>>>>            being
>>>>>
>>>>>                                    subject to these constraints.  Truth
>>>>>                                    be told I *still* want to do
>>>>>
>>>>>            that
>>>>>
>>>>>                                    because, most of the time, updates
>>>>>                                    to the latter are independent of,
>>>>>                                    and
>>>>>                                    happen subsequently to, loading the
>>>>>                                    former.
>>>>>
>>>>>                                    Suggestions would be welcome,
>>>>>                                    because other than these minor
>>>>>
>>>>>            niggles,
>>>>>
>>>>>                                    the
>>>>>                                    SummarizedExperiment class is almost
>>>>>                                    perfect for many tasks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                    On Wed, Aug 29, 2012 at 9:57 PM, Tim
>>>>>                                    Triche, Jr. <
>>>>>
>>>>>            tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>>
>>>>>
>>>>>
>>>>>                                        wrote:
>>>>>
>>>>>
>>>>>                                       assigning new colData columns, or
>>>>>                                    overwriting old ones, in a
>>>>>
>>>>>            sizable
>>>>>
>>>>>                                        (say
>>>>>                                        500000 row x 800 column) SE is
>>>>>                                        nauseatingly slow.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                        There has to be a better way --
>>>>>                                        I'm willing to write it if
>>>>> someone
>>>>>
>>>>>            can
>>>>>
>>>>>                                        point out an obvious way to do it
>>>>>
>>>>>
>>>>>
>>>>>                                        On Wed, Aug 29, 2012 at 9:52 PM,
>>>>>                                        Kasper Daniel Hansen <
>>>>>                                        kasperdanielhansen at gmail.com
>>>>>
>>>>> <mailto:kasperdanielhansen at gmail.com>>
>>>>>
>>>>>                                        wrote:
>>>>>
>>>>>                                           On Thu, Aug 30, 2012 at 12:44
>>>>>                                        AM, Martin Morgan <
>>>>>
>>>>>            mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>>>>>
>>>>>
>>>>>                                            wrote:
>>>>>
>>>>>                                                On 08/29/2012 06:46 PM,
>>>>>                                                Kasper Daniel Hansen
>>>>> wrote:
>>>>>
>>>>>
>>>>>                                                    There is a lot of
>>>>>                                                    good stuff to say
>>>>>                                                    about
>>>>>
>>>>> SummarizedExperiments,
>>>>>                                                    and
>>>>>                                                    from a certain point
>>>>>                                                    of view I have a
>>>>>                                                    parallel
>>>>>                                                    implementation in
>>>>>
>>>>>                                                bsseq
>>>>>
>>>>>
>>>>>                                           (and there is also one in
>>>>>                                        genoset).
>>>>>
>>>>>
>>>>>                                                    However, I really
>>>>>                                                    like having the
>>>>>                                                    assayData inside an
>>>>>
>>>>>            environment.
>>>>>
>>>>>                                                    This helps some on
>>>>>                                                    memory and - equally
>>>>>                                                    important - speed at
>>>>> the
>>>>>                                                    command line.  I
>>>>>                                                    certainly need to
>>>>>                                                    very heavily
>>>>>                                                    consider using
>>>>>
>>>>>            an
>>>>>
>>>>>                                                    environment in bsseq.
>>>>>
>>>>>                                                    After some
>>>>>                                                    discussion with Tim
>>>>>                                                    (Triche) we have
>>>>>                                                    agreed that
>>>>>                                                    something
>>>>>                                                    like
>>>>>                                                    SummarizedExperiments
>>>>> is
>>>>>                                                    the way to go at
>>>>>                                                    least for the
>>>>>                                                    methylation arrays.
>>>>>                                                      We need to be able
>>>>>                                                    to easily handle
>>>>> 1000s
>>>>>
>>>>>            of
>>>>>
>>>>>                                                    samples.
>>>>>
>>>>>                                                    What is the chance
>>>>>                                                    that we can get the
>>>>>                                                    option of having the
>>>>>                                                    assayData
>>>>>                                                    inside an
>>>>>                                                    environment, perhaps
>>>>> by
>>>>>                                                          Making a class
>>>>>                                                    that is an
>>>>>                                                    environment and
>>>>>                                                    inherits from
>>>>>
>>>>>                                                SimpleList.
>>>>>
>>>>>
>>>>>                                               Using a classUnion
>>>>>                                        between the existing class of
>>>>>                                        the assayData
>>>>>
>>>>>                                                    and
>>>>>                                                    an environment.
>>>>>                                                          Third option
>>>>>                                                    that is probably
>>>>>                                                    better than the
>>>>>                                                    proceeding
>>>>>
>>>>>            two,
>>>>>
>>>>>                                                    but
>>>>>                                                    which I cannot come
>>>>>                                                    up with right now.
>>>>>
>>>>>
>>>>>
>>>>>                                                Probably something can /
>>>>>                                                will be done. I guess
>>>>>                                                the slowness
>>>>>
>>>>>            you're
>>>>>
>>>>>
>>>>>                                            talking
>>>>>
>>>>>                                                about is when rowData /
>>>>>                                                colData columns are
>>>>>                                                manipulated; any
>>>>>
>>>>>            kind of
>>>>>
>>>>>                                                subsetting would mean a
>>>>>                                                'deep' copy. Martin
>>>>>
>>>>>
>>>>>                                            Yes, for example
>>>>>                                            manipulating colData -
>>>>>                                            something that
>>>>>
>>>>>            conceptually
>>>>>
>>>>>                                            should be quick and easy.
>>>>>                                              Of course, this will not
>>>>>                                            affect any
>>>>>
>>>>>            real
>>>>>
>>>>>                                            computation on the assayData
>>>>>                                            matrices, but it will make
>>>>>                                            life at
>>>>>
>>>>>            the
>>>>>
>>>>>                                            command prompt more pleasant.
>>>>>
>>>>>                                            Kasper
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                                    This would - in my
>>>>>                                                    opinion - be very
>>>>>                                                    nice and worthwhile.
>>>>>
>>>>>                                                    Kasper
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>
>>>>> Bioc-devel at r-project.org
>>>>>
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>>                                                    mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                                --
>>>>>                                                Computational Biology /
>>>>>                                                Fred Hutchinson Cancer
>>>>>                                                Research Center
>>>>>                                                1100 Fairview Ave. N.
>>>>>                                                PO Box 19024 Seattle, WA
>>>>>                                                98109
>>>>>
>>>>>                                                Location: Arnold
>>>>>                                                Building M1 B861
>>>>>                                                Phone: (206) 667-2793
>>>>>
>>>>> <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>                                            Bioc-devel at r-project.org
>>>>>
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>>                                            mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                        --
>>>>>                                        *A model is a lie that helps you
>>>>>                                        see the truth.*
>>>>>                                        *
>>>>>                                        *
>>>>>                                        Howard Skipper<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>>                                    **>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                    --
>>>>>                                    *A model is a lie that helps you see
>>>>>                                    the truth.*
>>>>>                                    *
>>>>>                                    *
>>>>>                                    Howard Skipper<
>>>>>
>>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>>                                    **>
>>>>>
>>>>>                                               [[alternative HTML
>>>>>                                    version deleted]]
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>                                    Bioc-devel at r-project.org
>>>>>                                    <mailto:Bioc-devel at r-project.org>
>>>>>                                    mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>                                          [[alternative HTML version
>>>>>                                deleted]]
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>                                Bioc-devel at r-project.org
>>>>>                                <mailto:Bioc-devel at r-project.org>
>>>>>                                mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                            --
>>>>>                            Computational Biology / Fred Hutchinson
>>>>>                            Cancer Research Center
>>>>>                            1100 Fairview Ave. N.
>>>>>                            PO Box 19024 Seattle, WA 98109
>>>>>
>>>>>                            Location: Arnold Building M1 B861
>>>>>                            Phone: (206) 667-2793
>>>>> <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>                            Bioc-devel at r-project.org
>>>>>                            <mailto:Bioc-devel at r-project.org> mailing
>>>>> list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                --
>>>>>                *A model is a lie that helps you see the truth.*
>>>>>                *
>>>>>                *
>>>>>                Howard Skipper<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>>                          [[alternative HTML version deleted]]
>>>>>
>>>>>                _________________________________________________
>>>>>                Bioc-devel at r-project.org
>>>>>                <mailto:Bioc-devel at r-project.org> mailing list
>>>>>                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>>
>>>>>                 [[alternative HTML version deleted]]
>>>>>
>>>>>        _________________________________________________
>>>>>        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>        mailing list
>>>>>        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>>
>>>>>    --
>>>>>    Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>    1100 Fairview Ave. N.
>>>>>    PO Box 19024 Seattle, WA 98109
>>>>>
>>>>>    Location: Arnold Building M1 B861
>>>>>    Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> /A model is a lie that helps you see the truth./
>>>>> /
>>>>> /
>>>>> Howard Skipper
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>
>>>
>>>
>>>
>>> --
>>> A model is a lie that helps you see the truth.
>>>
>>> Howard Skipper
>>>



More information about the Bioc-devel mailing list