[Bioc-devel] SummarizedExperiments

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Wed Sep 12 21:15:26 CEST 2012


On Wed, Sep 12, 2012 at 2:23 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 09/06/2012 02:26 PM, Kasper Daniel Hansen wrote:
>>
>> Alltogether, this looks great.  I am still implementing/testing, but
>> seems pretty nice.
>>
>> Here is a number of convenience methods I believe should be
>> implemented.  Some of them comes from eSet and some of them comes from
>> my own work in bsseq, but I have used all of them extensively in
>> various packages.
>>
>> sampleNames, featureNames, pData
>> granges, start, end, width, strand, seqnames, seqlengths, seqlevels
>> findOverlaps, subsetByOverlaps
>
>
> GenomicRanges 1.9.65 in Bioc devel makes the following operations on
> SummarizedExperiment work on the underlying rowData; I particularly like
> subsetByOverslaps() for selecting, e.g., SummarizedExperiment rows that
> overlap features of interest, and seqlevels(x, force=TRUE) = paste("chr",
> 1:5), for instance, for selecting sequences of interest. See the section
> 'GRanges compatibility (rowData access)' on ?SummarizedExperiment.

Nice list, this is useful.  Now I guess I need to update bsseq.

seqselect is nice, but also destructive.  In my own work I have found
it quite useful to be able to subset on a whole chromosome, without
destroying the old object, like
  function(sset, "chr1")
I know this can be done by a long subsetByOverlaps, but - like
seqselect - it is nice to have a quicker way of doing it.

I don't see any mention of
  sset[gr,]
Is this planned or was it decided to drop this?

One thing I have in my package that I find indispensable is combine
and (my own) combineList.  The later for combining > 2 objects, which
has a lot of possibilities for speed up especially if (very common)
all the objects have the same rowData, as opposed to Reduce(combine,
LIST)..  Usecase: you need to add additional samples to your
SummarizedExperiment.

My code (which may utilize that width(object) == 1 is in bsseq/R/BSseq-class.R

Kasper



>
> Martin
>
> compare, countOverlaps, coverage, disjointBins, distance, distanceToNearest,
> duplicated, end, end<-, findOverlaps, flank, follow, granges, isDisjoint,
> match, mcols, mcols<-, narrow, nearest, order, precede, ranges, ranges<-,
> rank, resize, restrict, seqinfo, seqinfo<-, seqnames, shift, sort, start,
> start<-, strand, strand<-, width, width<-.
>
>
>>
>> 'granges' is a method already defined in GenomicRanges, I have found
>> it very convenient.  It is not clear to me why it is all lowercase
>>
>> What I have done in bsseq for the later two lines of methods, is to
>> have a class 'hasGRanges' that designate any object with a GRanges.
>> Then implement the granges() method and the rest goes from there.  Not
>> sure whether that makes sense in this context, but in general I have
>> found it very convenient for my own work.  Code for all of this is in
>> bsseq/R/hasGRanges.R.  I have two classes in the package inheriting
>> from hasGRanges.  Note that the classes in bsseq have not been updated
>> to build on the new SummarizedExperiment (yet).
>>
>> Kasper
>>
>>
>> On Wed, Sep 5, 2012 at 1:31 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>> wrote:
>>>
>>> It seems to work with a Matrix in the assays:
>>>
>>> R> packageVersion('GenomicRanges')
>>> [1] ‘1.9.61’
>>>
>>> R> SE <- SummarizedExperiment(assays=wts, rowData=wts.locs)
>>> Error in function (classes, fdef, mtable)  :
>>>    unable to find an inherited method for function
>>> "SummarizedExperiment",
>>> for signature "dgCMatrix"
>>>
>>> R> SE <- SummarizedExperiment(assays=SimpleList(weights=wts),
>>> rowData=wts.locs)
>>> dimnames(.) <- NULL:  translated to
>>> dimnames(.) <- list(NULL,NULL)  <==>  unname(.)
>>>
>>> R> show(SE)
>>> class: SummarizedExperiment
>>> dim: 35855 35855
>>> exptData(0):
>>> assays(1): weights
>>> rownames(35855): feat1 feat2 ... feat35854 feat35855
>>> rowData metadata column names(10): egid signed ... DHS_DGF chromHMM_state
>>> colnames(35855): feat1 feat2 ... feat35854 feat35855
>>> colData names(0):
>>>
>>> R> class(assays(SE, 'weights'))
>>> [1] "SimpleList"
>>> attr(,"package")
>>> [1] "IRanges"
>>>
>>> R> class(assays(SE)$weights)
>>> [1] "dgCMatrix"
>>> attr(,"package")
>>> [1] "Matrix"
>>>
>>>
>>> Very cool!  Albeit a tad disorienting.  I was going to patch this up and
>>> then realized, hey, if someone's going to stuff a Matrix into their
>>> assays,
>>> they ought to give it a name and know what's going on under the hood, at
>>> least for now.  It seems to work correctly, too.
>>>
>>> Oh oh, I spoke too soon.  If I split by chromosome, the assay falls out:
>>>
>>> R> head(assays(SE)$weights)
>>> 6 x 35855 sparse Matrix of class "dgCMatrix"
>>>     [[ suppressing 82 column names ‘feat1’, ‘feat2’, ‘feat3’ ... ]]
>>>
>>> feat1 1 . .         .         .          .          . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat2 . 1 .         .         .          .          . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat3 . . 1.0000000 0.5737488 .          .          . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat4 . . 0.5737488 1.0000000 .          .          . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat5 . . .         .         1.00000000 0.04300373 . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat6 . . .         .         0.04300373 1.00000000 . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>>
>>> feat1 . . . . . . . . . . . . . . ......
>>> feat2 . . . . . . . . . . . . . . ......
>>> feat3 . . . . . . . . . . . . . . ......
>>> feat4 . . . . . . . . . . . . . . ......
>>> feat5 . . . . . . . . . . . . . . ......
>>> feat6 . . . . . . . . . . . . . . ......
>>>
>>>   .....suppressing columns in show(); maybe adjust 'options(max.print=
>>> *)'
>>>   ..............................
>>> R> SE.by.chr <- split(SE, as.vector(seqnames(SE)))
>>> R> head(assays(SE.by.chr)$weights)
>>> Error in head(assays(SE.by.chr)$weights) :
>>>    error in evaluating the argument 'x' in selecting a method for
>>> function
>>> 'head': Error in function (classes, fdef, mtable)  :
>>>    unable to find an inherited method for function "assays", for
>>> signature
>>> "list"
>>>
>>> Ideas?  I would love to do something useful here instead of just fussing.
>>> It's really a great structure.
>>>
>>> I love the use of a reference class to reference the internal pieces of
>>> an
>>> object, it's such a clever hack.  Thanks much for adding it.
>>>
>>> --t
>>>
>>>
>>>
>>> On Tue, Sep 4, 2012 at 7:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>>
>>>>
>>>> On 09/04/2012 07:02 PM, Tim Triche, Jr. wrote:
>>>>>
>>>>>
>>>>> This is really cool.
>>>>>
>>>>> R> head(lsos(), 1)
>>>>>                               Type     Size  Rows Columns
>>>>> LAML.RNAseq SummarizedExperiment 38666528 23529     173
>>>>> R> LAML.RNAseq.updated <- updateObject(LAML.RNAseq)
>>>>> R> head(lsos(), 2)
>>>>>                                       Type     Size  Rows Columns
>>>>> LAML.RNAseq         SummarizedExperiment 38666528 23529     173
>>>>> LAML.RNAseq.updated SummarizedExperiment  6101488 23529     173
>>>>
>>>>
>>>>
>>>> I think either the sizes are misleading (missing the content of the
>>>> environment that the reference classes contain, probably
>>>> LAML.RNAseq.updated at assays$data has size 38666528 - 6101488) or the data
>>>> has
>>>> not been updated correctly :\. Nonetheless manipulation of especially
>>>> non-assay but also assay data should be faster.
>>>>
>>>> I added a brief section (1.9.60) to ?SummarizedExperiment describing
>>>> what
>>>> was done. Also, I tried (w/out actually testing...) to add support for
>>>> Matrix instances.
>>>>
>>>> As a miscellaneous 'tip', when working with large data I find it useful
>>>> to
>>>> start R with --min-vsize=2048M --min-nsize=20M. These are documented in
>>>> RShowDoc("R-intro") as 'for expert use only' and influence how much
>>>> memory R
>>>> allocates for vectors ('vsize'; vsize can be very large) and for
>>>> S-expressions ('nsize' 50000000) and help get up to large memory
>>>> allocations
>>>> without innumerable garbage collections. Of course having big memory is
>>>> of
>>>> primary importance.
>>>>
>>>> Martin
>>>>
>>>>> R> str(LAML.RNAseq.updated at assays)
>>>>> Reference class 'ShallowSimpleListAssays' [package "GenomicRanges"]
>>>>> with
>>>>> 1 fields
>>>>>    $ data:Formal class 'SimpleList' [package "IRanges"] with 4 slots
>>>>>     .. ..@ listData       :List of 1
>>>>>     .. .. ..$ rpm: num [1:23529, 1:173] 0 0 8.12 10.84 27.73 ...
>>>>>     .. ..@ elementType    : chr "ANY"
>>>>>     .. ..@ elementMetadata: NULL
>>>>>     .. ..@ metadata       : list()
>>>>>    and 11 methods,R>
>>>>> R> packageVersion('GenomicRanges')
>>>>> [1] ‘1.9.59’
>>>>>
>>>>> A slightly bigger one:
>>>>>
>>>>> R> head(lsos(),2)
>>>>>                                Type       Size   Rows Columns
>>>>> LAML         SummarizedExperiment 1950606472 485577     192
>>>>> LAML.updated SummarizedExperiment  458912760 485577     192
>>>>> R> system.time(LAML.updated$foo <- rep('bar', 192))
>>>>>      user  system elapsed
>>>>>     0.132   0.116   0.248
>>>>> R> system.time(LAML$foo <- rep('bar', 192))
>>>>>      user  system elapsed
>>>>>     2.728   2.144   7.519
>>>>>
>>>>>
>>>>> That is a really clever hack you added.  I think I will have to figure
>>>>> out how it works :-)
>>>>>
>>>>> Thanks!!!
>>>>>
>>>>>
>>>>> On Sun, Sep 2, 2012 at 2:47 PM, Martin Morgan <mtmorgan at fhcrc.org
>>>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>
>>>>>      On 08/30/2012 08:47 AM, Vincent Carey wrote:
>>>>>
>>>>>          I am all in favor of the dialogue and the aim.  I reproduced
>>>>>          your first
>>>>>          timing, and note
>>>>>
>>>>>              unix.time(ss2 at colData$tx <- 1:1000)
>>>>>
>>>>>               user  system elapsed
>>>>>              5.963   4.520  10.483
>>>>>
>>>>>              unix.time(colData(ss2)$tx2 <- 1:1000)
>>>>>
>>>>>               user  system elapsed
>>>>>             17.937  13.074  31.016
>>>>>
>>>>>          BTW did you really mean to put the meth/unmeth in exptData, or
>>>>>          should it be
>>>>>          in assays?
>>>>>
>>>>>               From an experimentation perspective, I would want to redo
>>>>> these
>>>>>
>>>>>          computations above with an environment holding the assay data
>>>>>          and see how
>>>>>          it changes.  I suppose one can infer from first principles
>>>>> that
>>>>>
>>>>>              ss4 at assays[[1]]$meth[1:4,1:4]
>>>>>
>>>>>                 [,1] [,2] [,3] [,4]
>>>>>          [1,]  0.2  0.2  0.2  0.2
>>>>>          [2,]  0.2  0.2  0.2  0.2
>>>>>          [3,]  0.2  0.2  0.2  0.2
>>>>>          [4,]  0.2  0.2  0.2  0.2
>>>>>
>>>>>              ss4 at assays[[2]]$unmeth[1:4,1:__4]
>>>>>
>>>>>
>>>>>                 [,1] [,2] [,3] [,4]
>>>>>          [1,]  0.2  0.2  0.2  0.2
>>>>>          [2,]  0.2  0.2  0.2  0.2
>>>>>          [3,]  0.2  0.2  0.2  0.2
>>>>>          [4,]  0.2  0.2  0.2  0.2
>>>>>
>>>>>              unix.time(colData(ss4)$tx2 <- 1:1000)
>>>>>
>>>>>               user  system elapsed
>>>>>              0.010   0.002   0.012
>>>>>
>>>>>          where ss4 has a SimpleList of environments holding the assay
>>>>> data.
>>>>>
>>>>>              ss4
>>>>>
>>>>>          class: SummarizedExperiment
>>>>>          dim: 450000 1000
>>>>>          exptData(0):
>>>>>          assays(2): '' ''
>>>>>          rownames: NULL
>>>>>          rowData values names(0):
>>>>>          colnames: NULL
>>>>>          colData names(4): sampleID treatment tx tx2
>>>>>
>>>>>              ss4 at assays[[1]]
>>>>>
>>>>>          <environment: 0x1f4e0ba0>
>>>>>
>>>>>          This can't be attempted without disabling validity checking,
>>>>> and
>>>>>          the stack
>>>>>          trace from the attempt is instructive.
>>>>>
>>>>>          I don't think it will be too difficult to establish a
>>>>> customized
>>>>>          SummarizedExperiment that permits this, using the
>>>>> lockedEnvironment
>>>>>          approach of ExpressionSets.  I also do not know the down sides
>>>>>          but it seems
>>>>>          you/we could get some mileage on the approach and see.
>>>>>
>>>>>
>>>>>      GenomicRanges v. 1.9.59 has some preliminary changes that address
>>>>>      the performance issue. Serialized (saved) objects 'x' need to be
>>>>>      updated with updateObject(x) (and be careful if these are your
>>>>> only
>>>>>      instances of some hard work). There are some loose ends to tidy
>>>>> up.
>>>>>
>>>>>      Martin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          On Thu, Aug 30, 2012 at 10:53 AM, Kasper Daniel Hansen <
>>>>>          kasperdanielhansen at gmail.com
>>>>>          <mailto:kasperdanielhansen at gmail.com>> wrote:
>>>>>
>>>>>              My perspective (and the reason for me email) is pretty
>>>>> simple.
>>>>>
>>>>>              First off, let me say that we will always be working on
>>>>>              trading off
>>>>>              memory and IO and development time and all the other
>>>>> stuff.
>>>>>                I mean,
>>>>>              clearly the environment stuff for ExpressionSets did not
>>>>>              help all
>>>>>              people, because we have xps, aroma.affymetrix and oligo
>>>>> all
>>>>>              using some
>>>>>              kind of file based backend.
>>>>>
>>>>>              I am not trying to solve something "big".  Putting
>>>>>              methylation data
>>>>>              inside an environment is not suddenly going to make my
>>>>>              algorithms go
>>>>>              _significantly_ faster or the memory usage to go
>>>>>              _significantly_ down.
>>>>>                 But it makes life faster when you explore data - I
>>>>> believe.
>>>>>
>>>>>              Now for two realistic examples (one minfi, one bsseq) that
>>>>> I
>>>>>              encounter
>>>>>              all the time:
>>>>>              library(GenomicRanges)
>>>>>              nRows <- 450000
>>>>>              nCols <- 1000
>>>>>              SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>>              rep("chr1",
>>>>>              nRows),
>>>>>                                             ranges = IRanges(1:nRows,
>>>>>              width = 1)),
>>>>>                                             colData =
>>>>> DataFrame(sampleID =
>>>>>              paste0("person", 1:nCols)),
>>>>>                                             exptData = SimpleList(meth
>>>>> =
>>>>>              matrix(0.2,
>>>>>              ncol = nCols, nrow = nRows),
>>>>>                                             unmeth = matrix(0.2, ncol =
>>>>>              nCols, nrow =
>>>>>              nRows)))
>>>>>              print(object.size(SSet), units = "auto")  # 6.7GB
>>>>>              system.time({
>>>>>                    colData(SSet)$treatment = rep(c("a", "b"), nCols /
>>>>> 2)
>>>>>              }) # around 26 sec
>>>>>
>>>>>
>>>>>              nRows <- 28000000
>>>>>              nCols <- 10
>>>>>              SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>>              rep("chr1",
>>>>>              nRows),
>>>>>                                             ranges = IRanges(1:nRows,
>>>>>              width = 1)),
>>>>>                                             colData =
>>>>> DataFrame(sampleID =
>>>>>              paste0("person", 1:nCols)),
>>>>>                                             exptData = SimpleList(meth
>>>>> =
>>>>>              matrix(0.2,
>>>>>              ncol = nCols, nrow = nRows),
>>>>>                                             unmeth = matrix(0.2, ncol =
>>>>>              nCols, nrow =
>>>>>              nRows)))
>>>>>              print(object.size(SSet), units = "auto")  # 4.4GB
>>>>>              system.time({
>>>>>                    colData(SSet)$treatment = rep(c("a", "b"), nCols /
>>>>> 2)
>>>>>              }) # around 20 sec
>>>>>
>>>>>              This is of course not that slow, especially not when
>>>>>              compared to the
>>>>>              massive amount of computation I had to do in order to
>>>>> arrive
>>>>>              at the
>>>>>              SSet in the first place.  But the difference is that I
>>>>>              staring at the
>>>>>              screen when I work interactively, but I am out partying
>>>>> when
>>>>>              I run my
>>>>>              computations (overnight).  And I do stuff like the above
>>>>>              surprisingly
>>>>>              many times after I have generated the objects.
>>>>>
>>>>>              My perspective is that this was essentially "solved" with
>>>>> the
>>>>>              ExpressionSet class, and I feel we are regressing a bit.
>>>>> I
>>>>> also
>>>>>              believe that it should be "easy" to re-use some of the
>>>>>              ideas/code from
>>>>>              ExpressionSet and that it is worthwhile to do so, for
>>>>>              something that
>>>>>              may very well be an extremely important core class.
>>>>>              SummarizedExperiment is such a nice class that I
>>>>> essentially
>>>>>              have a
>>>>>              parallel implementation in bsseq and Pete has one in
>>>>> genoset.
>>>>>
>>>>>              So essentially, I was interested in fixing a pretty small
>>>>>              problem, but
>>>>>              something that I am a bit irritated about.  And I feel we
>>>>>              almost have
>>>>>              a ready to use solution.
>>>>>
>>>>>              The reason I ask about it here, and not just do it myself,
>>>>>              is twofold
>>>>>              (1) it is tricky to get the environment right, and I do
>>>>> not
>>>>> have
>>>>>              enough experience on the environment part of
>>>>> ExpressionSet,
>>>>>              that I
>>>>>              feel comfortable doing it
>>>>>              (2) I really think it would be beneficial to the
>>>>> community,
>>>>> and
>>>>>              experience tells us, that what happens in these core
>>>>>              infrastructure
>>>>>              packages are much more likely to be widely used.
>>>>>
>>>>>              While I think that it may be worthwhile for people to play
>>>>> with
>>>>>              reference classes, I think that going down that route is
>>>>>              unexplored
>>>>>              and may or may not be worth it.  There is something to be
>>>>>              said for
>>>>>              building on something we have experience with.  A
>>>>> reference
>>>>>              class idea
>>>>>              can be developed in parallel, if people are willing to
>>>>>              commit the
>>>>>              time.
>>>>>
>>>>>              Kasper
>>>>>
>>>>>
>>>>>              On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr.
>>>>>              <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>>>>
>>>>>              wrote:
>>>>>
>>>>>                     I am interested in how best to benchmark such
>>>>>                  operations (and another
>>>>>                  thought that I recently had, "why can't I put a Doug
>>>>>                  Bates style 'Matrix'
>>>>>                  in the SimpleList of assay data)?  Kasper is the
>>>>> person
>>>>>                  who convinced me
>>>>>                  that ff, rhdf5, etc.should be a last resort, since RAM
>>>>>                  is cheap.
>>>>>
>>>>>                  For large data, the TCGA BRCA samples are a hair under
>>>>>                  800 (last I
>>>>>
>>>>>              looked)
>>>>>
>>>>>                  450k arrays with matching RNAseq and DNAseq, and
>>>>> that's
>>>>>                  not even a big
>>>>>                  study by breast cancer standards (IIRC, Gordon Mills
>>>>> was
>>>>>                  rounding up 5000
>>>>>                  tumors) Both minfi and methylumi can work off of the
>>>>> raw
>>>>>                  data in these
>>>>>                  studies as a "practical" example, but when it really
>>>>>                  gets interesting is
>>>>>                  when 3-4 ragged assays exist for each patient.
>>>>>
>>>>>                  Exon-wise (or splice-graph-wise) RNAseq data with
>>>>> splice
>>>>>                  junction read
>>>>>                  counts is another place to find big SE-centric data
>>>>> with
>>>>>                  ambiguities.
>>>>>
>>>>>              One
>>>>>
>>>>>                  of the reasons I keep coming back to the idea of
>>>>>                  subsetting an SE by a
>>>>>                  GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is
>>>>>                  its speed. In any
>>>>>                  event, after my obligations for the day are
>>>>> dispatched,
>>>>>                  I'll write up an
>>>>>                  example of doing this off of TCGA data, perhaps from
>>>>>                  colorectal (only
>>>>>
>>>>>              about
>>>>>
>>>>>                  400 subjects but at least it is published already).
>>>>> You
>>>>>                  are right, a
>>>>>                  single use case is often worth 1000 specifications.
>>>>>
>>>>>                  --t
>>>>>
>>>>>
>>>>>
>>>>>                  On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
>>>>>                  <stvjc at channing.harvard.edu
>>>>>                  <mailto:stvjc at channing.harvard.edu>>__wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                      On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis
>>>>>                      <sdavis2 at mail.nih.gov
>>>>> <mailto:sdavis2 at mail.nih.gov>>
>>>>>
>>>>>
>>>>>              wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                          On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan
>>>>>                          <mtmorgan at fhcrc.org
>>>>> <mailto:mtmorgan at fhcrc.org>
>>>>>
>>>>>
>>>>>                  wrote:
>>>>>
>>>>>
>>>>>                              On 08/30/2012 04:42 AM, Vincent Carey
>>>>> wrote:
>>>>>
>>>>>                                  On Thu, Aug 30, 2012 at 6:27 AM, Tim
>>>>>                                  Triche, Jr. <
>>>>>
>>>>>              tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>>
>>>>>
>>>>>                                      wrote:
>>>>>
>>>>>
>>>>>                                     nb.  one of the reasons for the
>>>>>                                  existence of the MergedDataSet
>>>>>
>>>>>              class in
>>>>>
>>>>>                                      regulatoR (to be submitted for
>>>>>                                      review shortly) is that, while SEs
>>>>>
>>>>>              are
>>>>>
>>>>>                                      absolutely fantastic for managing
>>>>>                                      data matrices that are stapled to
>>>>>
>>>>>              a
>>>>>
>>>>>                                      GRanges, what is less awesome is
>>>>>                                      having a relatively light-weight
>>>>>                                      DataFrame
>>>>>                                      for phenotypic data that requires
>>>>>                                      the entire memory footprint be
>>>>>                                      recreated
>>>>>                                      upon writing a new column into
>>>>> said
>>>>>                                      DataFrame.
>>>>>
>>>>>                                      If R5 classes didn't spook me a
>>>>>                                      little, I would already have done
>>>>>                                      something
>>>>>
>>>>>
>>>>>                                  We don't often use this R5 terminology
>>>>>                                  but I see Hadley has made an
>>>>>                                  accessible document referring to
>>>>>                                  reference classes in this way.
>>>>>
>>>>>
>>>>>                              To me the challenge is more conceptual --
>>>>>                              pass-by-reference and the
>>>>>
>>>>>              way
>>>>>
>>>>>                              that two variables pointing to the
>>>>> instance
>>>>>                              are updated at the same
>>>>>
>>>>>              time --
>>>>>
>>>>>                              and I had been thinking of a
>>>>>                              LockedEnvironment-style implementation
>>>>>
>>>>>              where
>>>>>
>>>>>                              some operations were free ('copying') but
>>>>>                              others weren't (subset,
>>>>>
>>>>>              subset
>>>>>
>>>>>                              assign). But maybe there are some more
>>>>>                              direct approaches...
>>>>>
>>>>>
>>>>>                                 My 2c: This is a situation where some
>>>>>                              experimental data would be
>>>>>
>>>>>                                  helpful.
>>>>>
>>>>>
>>>>>
>>>>>                          Perhaps Tim or Kasper can share a largish
>>>>>                          (unpublished) dataset and a
>>>>>                          typical workflow with the Seattle folks.  Even
>>>>>                          that level of detail
>>>>>
>>>>>              would
>>>>>
>>>>>                          give some sense of the scope of the problem.
>>>>>
>>>>>
>>>>>
>>>>>                      For clarification, I did not mean large data,
>>>>>                      although that would be
>>>>>                      welcome.  I meant data on computational
>>>>> experiments
>>>>>                      with different
>>>>>                      approaches -- that is, performance statistics that
>>>>>                      would indicate where
>>>>>                      scalability fails, what steps have been taken to
>>>>>                      recover it, and how
>>>>>
>>>>>              costly
>>>>>
>>>>>                      those steps are in terms of complexity,
>>>>> reliability,
>>>>>                      maintainability,
>>>>>
>>>>>              risk
>>>>>
>>>>>                      at the user/application end.
>>>>>
>>>>>
>>>>>
>>>>>                          Sean
>>>>>
>>>>>
>>>>>                              Yes, for instance where in the interactive
>>>>>                              use is time being spent? Is
>>>>>                              it copying the assays, or validity, or
>>>>>                              actually updating the row
>>>>>
>>>>>              data? Is
>>>>>
>>>>>                              500000 x 800 an appropriate scale to be
>>>>>                              thinking about?
>>>>>
>>>>>
>>>>>                                   The main avenues for a developer seem
>>>>>                              to be a) use environments or
>>>>>
>>>>>                                  reference classes; there are some
>>>>> costs
>>>>>                                  and we should understand
>>>>>
>>>>>              them,
>>>>>
>>>>>                                  and
>>>>>                                  b) use an out-of-memory approach like
>>>>>                                  rhdf5 or ff.  Again there will
>>>>>
>>>>>              be
>>>>>
>>>>>                                  some costs.  It should be relatively
>>>>>                                  easy to experiment with these.
>>>>>
>>>>>                 One
>>>>>
>>>>>                                  thing I just learned about is
>>>>>                                  setValidity2 and disableValidity
>>>>>
>>>>>              (defined
>>>>>
>>>>>                                  in
>>>>>                                  IRanges IIRC) ... these allow you to
>>>>>                                  construct certain variations on
>>>>>                                  SummarizedExperiment with less
>>>>> attention
>>>>>                                  to deeper infrastructure.
>>>>>
>>>>>
>>>>>                              probably I can make better use of the
>>>>>                              insights the IRanges guys have
>>>>>
>>>>>              had
>>>>>
>>>>>                              in their careful development and
>>>>> application
>>>>>                              of validity methods,
>>>>>
>>>>>              though I
>>>>>
>>>>>                              feel a bit like these are 'attractive
>>>>>                              hazards' that tempt us to do
>>>>>
>>>>>              unsafe
>>>>>
>>>>>                              things and then pay the price later. This
>>>>> is
>>>>>                              likely the first
>>>>>
>>>>>              direction
>>>>>
>>>>>                              I'll explore.
>>>>>
>>>>>                              Exploring a little I already see that
>>>>> there
>>>>>                              are some pretty dumb
>>>>>
>>>>>              things
>>>>>
>>>>>                              being done in assignment.
>>>>>
>>>>>                              Martin
>>>>>
>>>>>
>>>>>                                 whereby the assays/libraries for a
>>>>> given
>>>>>                              study subject are all
>>>>>
>>>>>              pointed
>>>>>
>>>>>                                      to
>>>>>                                      as SEs (i.e. RNAseq, BSseq,
>>>>>                                      expression/methylation arrays,
>>>>> CNV/SNP
>>>>>                                      arrays,
>>>>>                                      WGS or exomic DNAseq) and the
>>>>> column
>>>>>                                      (phenotype) data can avoid
>>>>>
>>>>>              being
>>>>>
>>>>>                                      subject to these constraints.
>>>>> Truth
>>>>>                                      be told I *still* want to do
>>>>>
>>>>>              that
>>>>>
>>>>>                                      because, most of the time, updates
>>>>>                                      to the latter are independent of,
>>>>>                                      and
>>>>>                                      happen subsequently to, loading
>>>>> the
>>>>>                                      former.
>>>>>
>>>>>                                      Suggestions would be welcome,
>>>>>                                      because other than these minor
>>>>>
>>>>>              niggles,
>>>>>
>>>>>                                      the
>>>>>                                      SummarizedExperiment class is
>>>>> almost
>>>>>                                      perfect for many tasks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                      On Wed, Aug 29, 2012 at 9:57 PM,
>>>>> Tim
>>>>>                                      Triche, Jr. <
>>>>>
>>>>>              tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>>
>>>>>
>>>>>
>>>>>                                          wrote:
>>>>>
>>>>>
>>>>>                                         assigning new colData columns,
>>>>> or
>>>>>                                      overwriting old ones, in a
>>>>>
>>>>>              sizable
>>>>>
>>>>>                                          (say
>>>>>                                          500000 row x 800 column) SE is
>>>>>                                          nauseatingly slow.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                          There has to be a better way
>>>>> --
>>>>>                                          I'm willing to write it if
>>>>> someone
>>>>>
>>>>>              can
>>>>>
>>>>>                                          point out an obvious way to do
>>>>> it
>>>>>
>>>>>
>>>>>
>>>>>                                          On Wed, Aug 29, 2012 at 9:52
>>>>> PM,
>>>>>                                          Kasper Daniel Hansen <
>>>>>                                          kasperdanielhansen at gmail.com
>>>>>
>>>>> <mailto:kasperdanielhansen at gmail.com>>
>>>>>
>>>>>                                          wrote:
>>>>>
>>>>>                                             On Thu, Aug 30, 2012 at
>>>>> 12:44
>>>>>                                          AM, Martin Morgan <
>>>>>
>>>>>              mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>>>>>
>>>>>
>>>>>                                              wrote:
>>>>>
>>>>>                                                  On 08/29/2012 06:46
>>>>> PM,
>>>>>                                                  Kasper Daniel Hansen
>>>>> wrote:
>>>>>
>>>>>
>>>>>                                                      There is a lot of
>>>>>                                                      good stuff to say
>>>>>                                                      about
>>>>>
>>>>> SummarizedExperiments,
>>>>>                                                      and
>>>>>                                                      from a certain
>>>>> point
>>>>>                                                      of view I have a
>>>>>                                                      parallel
>>>>>                                                      implementation in
>>>>>
>>>>>                                                  bsseq
>>>>>
>>>>>
>>>>>                                             (and there is also one in
>>>>>                                          genoset).
>>>>>
>>>>>
>>>>>                                                      However, I really
>>>>>                                                      like having the
>>>>>                                                      assayData inside
>>>>> an
>>>>>
>>>>>              environment.
>>>>>
>>>>>                                                      This helps some on
>>>>>                                                      memory and -
>>>>> equally
>>>>>                                                      important - speed
>>>>> at
>>>>> the
>>>>>                                                      command line.  I
>>>>>                                                      certainly need to
>>>>>                                                      very heavily
>>>>>                                                      consider using
>>>>>
>>>>>              an
>>>>>
>>>>>                                                      environment in
>>>>> bsseq.
>>>>>
>>>>>                                                      After some
>>>>>                                                      discussion with
>>>>> Tim
>>>>>                                                      (Triche) we have
>>>>>                                                      agreed that
>>>>>                                                      something
>>>>>                                                      like
>>>>>
>>>>> SummarizedExperiments
>>>>> is
>>>>>                                                      the way to go at
>>>>>                                                      least for the
>>>>>                                                      methylation
>>>>> arrays.
>>>>>                                                        We need to be
>>>>> able
>>>>>                                                      to easily handle
>>>>> 1000s
>>>>>
>>>>>              of
>>>>>
>>>>>                                                      samples.
>>>>>
>>>>>                                                      What is the chance
>>>>>                                                      that we can get
>>>>> the
>>>>>                                                      option of having
>>>>> the
>>>>>                                                      assayData
>>>>>                                                      inside an
>>>>>                                                      environment,
>>>>> perhaps
>>>>> by
>>>>>                                                            Making a
>>>>> class
>>>>>                                                      that is an
>>>>>                                                      environment and
>>>>>                                                      inherits from
>>>>>
>>>>>                                                  SimpleList.
>>>>>
>>>>>
>>>>>                                                 Using a classUnion
>>>>>                                          between the existing class of
>>>>>                                          the assayData
>>>>>
>>>>>                                                      and
>>>>>                                                      an environment.
>>>>>                                                            Third option
>>>>>                                                      that is probably
>>>>>                                                      better than the
>>>>>                                                      proceeding
>>>>>
>>>>>              two,
>>>>>
>>>>>                                                      but
>>>>>                                                      which I cannot
>>>>> come
>>>>>                                                      up with right now.
>>>>>
>>>>>
>>>>>
>>>>>                                                  Probably something can
>>>>> /
>>>>>                                                  will be done. I guess
>>>>>                                                  the slowness
>>>>>
>>>>>              you're
>>>>>
>>>>>
>>>>>                                              talking
>>>>>
>>>>>                                                  about is when rowData
>>>>> /
>>>>>                                                  colData columns are
>>>>>                                                  manipulated; any
>>>>>
>>>>>              kind of
>>>>>
>>>>>                                                  subsetting would mean
>>>>> a
>>>>>                                                  'deep' copy. Martin
>>>>>
>>>>>
>>>>>                                              Yes, for example
>>>>>                                              manipulating colData -
>>>>>                                              something that
>>>>>
>>>>>              conceptually
>>>>>
>>>>>                                              should be quick and easy.
>>>>>                                                Of course, this will not
>>>>>                                              affect any
>>>>>
>>>>>              real
>>>>>
>>>>>                                              computation on the
>>>>> assayData
>>>>>                                              matrices, but it will make
>>>>>                                              life at
>>>>>
>>>>>              the
>>>>>
>>>>>                                              command prompt more
>>>>> pleasant.
>>>>>
>>>>>                                              Kasper
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                                      This would - in my
>>>>>                                                      opinion - be very
>>>>>                                                      nice and
>>>>> worthwhile.
>>>>>
>>>>>                                                      Kasper
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>
>>>>> Bioc-devel at r-project.org
>>>>>
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>>                                                      mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                                  --
>>>>>                                                  Computational Biology
>>>>> /
>>>>>                                                  Fred Hutchinson Cancer
>>>>>                                                  Research Center
>>>>>                                                  1100 Fairview Ave. N.
>>>>>                                                  PO Box 19024 Seattle,
>>>>> WA
>>>>>                                                  98109
>>>>>
>>>>>                                                  Location: Arnold
>>>>>                                                  Building M1 B861
>>>>>                                                  Phone: (206) 667-2793
>>>>>
>>>>> <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>                                              Bioc-devel at r-project.org
>>>>>
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>>                                              mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                          --
>>>>>                                          *A model is a lie that helps
>>>>> you
>>>>>                                          see the truth.*
>>>>>                                          *
>>>>>                                          *
>>>>>                                          Howard Skipper<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>>                                      **>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                                      --
>>>>>                                      *A model is a lie that helps you
>>>>> see
>>>>>                                      the truth.*
>>>>>                                      *
>>>>>                                      *
>>>>>                                      Howard Skipper<
>>>>>
>>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>>                                      **>
>>>>>
>>>>>                                                 [[alternative HTML
>>>>>                                      version deleted]]
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>                                      Bioc-devel at r-project.org
>>>>>                                      <mailto:Bioc-devel at r-project.org>
>>>>>                                      mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>                                            [[alternative HTML version
>>>>>                                  deleted]]
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>                                  Bioc-devel at r-project.org
>>>>>                                  <mailto:Bioc-devel at r-project.org>
>>>>>                                  mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                              --
>>>>>                              Computational Biology / Fred Hutchinson
>>>>>                              Cancer Research Center
>>>>>                              1100 Fairview Ave. N.
>>>>>                              PO Box 19024 Seattle, WA 98109
>>>>>
>>>>>                              Location: Arnold Building M1 B861
>>>>>                              Phone: (206) 667-2793
>>>>> <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>                              Bioc-devel at r-project.org
>>>>>                              <mailto:Bioc-devel at r-project.org> mailing
>>>>> list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                  --
>>>>>                  *A model is a lie that helps you see the truth.*
>>>>>                  *
>>>>>                  *
>>>>>                  Howard Skipper<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>>                            [[alternative HTML version deleted]]
>>>>>
>>>>>                  _________________________________________________
>>>>>                  Bioc-devel at r-project.org
>>>>>                  <mailto:Bioc-devel at r-project.org> mailing list
>>>>>                  https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>                  <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>>
>>>>>                   [[alternative HTML version deleted]]
>>>>>
>>>>>          _________________________________________________
>>>>>          Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>          mailing list
>>>>>          https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>>          <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>>
>>>>>      --
>>>>>      Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>      1100 Fairview Ave. N.
>>>>>      PO Box 19024 Seattle, WA 98109
>>>>>
>>>>>      Location: Arnold Building M1 B861
>>>>>      Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> /A model is a lie that helps you see the truth./
>>>>> /
>>>>> /
>>>>> Howard Skipper
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>
>>>
>>>
>>>
>>>
>>> --
>>> A model is a lie that helps you see the truth.
>>>
>>> Howard Skipper
>>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793



More information about the Bioc-devel mailing list