[Bioc-devel] SummarizedExperiments

Martin Morgan mtmorgan at fhcrc.org
Wed Sep 12 20:23:57 CEST 2012


On 09/06/2012 02:26 PM, Kasper Daniel Hansen wrote:
> Alltogether, this looks great.  I am still implementing/testing, but
> seems pretty nice.
>
> Here is a number of convenience methods I believe should be
> implemented.  Some of them comes from eSet and some of them comes from
> my own work in bsseq, but I have used all of them extensively in
> various packages.
>
> sampleNames, featureNames, pData
> granges, start, end, width, strand, seqnames, seqlengths, seqlevels
> findOverlaps, subsetByOverlaps

GenomicRanges 1.9.65 in Bioc devel makes the following operations on 
SummarizedExperiment work on the underlying rowData; I particularly like 
subsetByOverslaps() for selecting, e.g., SummarizedExperiment rows that 
overlap features of interest, and seqlevels(x, force=TRUE) = 
paste("chr", 1:5), for instance, for selecting sequences of interest. 
See the section 'GRanges compatibility (rowData access)' on 
?SummarizedExperiment.

Martin

compare, countOverlaps, coverage, disjointBins, distance, 
distanceToNearest, duplicated, end, end<-, findOverlaps, flank, follow, 
granges, isDisjoint, match, mcols, mcols<-, narrow, nearest, order, 
precede, ranges, ranges<-, rank, resize, restrict, seqinfo, seqinfo<-, 
seqnames, shift, sort, start, start<-, strand, strand<-, width, width<-.

>
> 'granges' is a method already defined in GenomicRanges, I have found
> it very convenient.  It is not clear to me why it is all lowercase
>
> What I have done in bsseq for the later two lines of methods, is to
> have a class 'hasGRanges' that designate any object with a GRanges.
> Then implement the granges() method and the rest goes from there.  Not
> sure whether that makes sense in this context, but in general I have
> found it very convenient for my own work.  Code for all of this is in
> bsseq/R/hasGRanges.R.  I have two classes in the package inheriting
> from hasGRanges.  Note that the classes in bsseq have not been updated
> to build on the new SummarizedExperiment (yet).
>
> Kasper
>
>
> On Wed, Sep 5, 2012 at 1:31 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
>> It seems to work with a Matrix in the assays:
>>
>> R> packageVersion('GenomicRanges')
>> [1] ‘1.9.61’
>>
>> R> SE <- SummarizedExperiment(assays=wts, rowData=wts.locs)
>> Error in function (classes, fdef, mtable)  :
>>    unable to find an inherited method for function "SummarizedExperiment",
>> for signature "dgCMatrix"
>>
>> R> SE <- SummarizedExperiment(assays=SimpleList(weights=wts),
>> rowData=wts.locs)
>> dimnames(.) <- NULL:  translated to
>> dimnames(.) <- list(NULL,NULL)  <==>  unname(.)
>>
>> R> show(SE)
>> class: SummarizedExperiment
>> dim: 35855 35855
>> exptData(0):
>> assays(1): weights
>> rownames(35855): feat1 feat2 ... feat35854 feat35855
>> rowData metadata column names(10): egid signed ... DHS_DGF chromHMM_state
>> colnames(35855): feat1 feat2 ... feat35854 feat35855
>> colData names(0):
>>
>> R> class(assays(SE, 'weights'))
>> [1] "SimpleList"
>> attr(,"package")
>> [1] "IRanges"
>>
>> R> class(assays(SE)$weights)
>> [1] "dgCMatrix"
>> attr(,"package")
>> [1] "Matrix"
>>
>>
>> Very cool!  Albeit a tad disorienting.  I was going to patch this up and
>> then realized, hey, if someone's going to stuff a Matrix into their assays,
>> they ought to give it a name and know what's going on under the hood, at
>> least for now.  It seems to work correctly, too.
>>
>> Oh oh, I spoke too soon.  If I split by chromosome, the assay falls out:
>>
>> R> head(assays(SE)$weights)
>> 6 x 35855 sparse Matrix of class "dgCMatrix"
>>     [[ suppressing 82 column names ‘feat1’, ‘feat2’, ‘feat3’ ... ]]
>>
>> feat1 1 . .         .         .          .          . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat2 . 1 .         .         .          .          . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat3 . . 1.0000000 0.5737488 .          .          . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat4 . . 0.5737488 1.0000000 .          .          . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat5 . . .         .         1.00000000 0.04300373 . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat6 . . .         .         0.04300373 1.00000000 . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>>
>> feat1 . . . . . . . . . . . . . . ......
>> feat2 . . . . . . . . . . . . . . ......
>> feat3 . . . . . . . . . . . . . . ......
>> feat4 . . . . . . . . . . . . . . ......
>> feat5 . . . . . . . . . . . . . . ......
>> feat6 . . . . . . . . . . . . . . ......
>>
>>   .....suppressing columns in show(); maybe adjust 'options(max.print= *)'
>>   ..............................
>> R> SE.by.chr <- split(SE, as.vector(seqnames(SE)))
>> R> head(assays(SE.by.chr)$weights)
>> Error in head(assays(SE.by.chr)$weights) :
>>    error in evaluating the argument 'x' in selecting a method for function
>> 'head': Error in function (classes, fdef, mtable)  :
>>    unable to find an inherited method for function "assays", for signature
>> "list"
>>
>> Ideas?  I would love to do something useful here instead of just fussing.
>> It's really a great structure.
>>
>> I love the use of a reference class to reference the internal pieces of an
>> object, it's such a clever hack.  Thanks much for adding it.
>>
>> --t
>>
>>
>>
>> On Tue, Sep 4, 2012 at 7:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>
>>> On 09/04/2012 07:02 PM, Tim Triche, Jr. wrote:
>>>>
>>>> This is really cool.
>>>>
>>>> R> head(lsos(), 1)
>>>>                               Type     Size  Rows Columns
>>>> LAML.RNAseq SummarizedExperiment 38666528 23529     173
>>>> R> LAML.RNAseq.updated <- updateObject(LAML.RNAseq)
>>>> R> head(lsos(), 2)
>>>>                                       Type     Size  Rows Columns
>>>> LAML.RNAseq         SummarizedExperiment 38666528 23529     173
>>>> LAML.RNAseq.updated SummarizedExperiment  6101488 23529     173
>>>
>>>
>>> I think either the sizes are misleading (missing the content of the
>>> environment that the reference classes contain, probably
>>> LAML.RNAseq.updated at assays$data has size 38666528 - 6101488) or the data has
>>> not been updated correctly :\. Nonetheless manipulation of especially
>>> non-assay but also assay data should be faster.
>>>
>>> I added a brief section (1.9.60) to ?SummarizedExperiment describing what
>>> was done. Also, I tried (w/out actually testing...) to add support for
>>> Matrix instances.
>>>
>>> As a miscellaneous 'tip', when working with large data I find it useful to
>>> start R with --min-vsize=2048M --min-nsize=20M. These are documented in
>>> RShowDoc("R-intro") as 'for expert use only' and influence how much memory R
>>> allocates for vectors ('vsize'; vsize can be very large) and for
>>> S-expressions ('nsize' 50000000) and help get up to large memory allocations
>>> without innumerable garbage collections. Of course having big memory is of
>>> primary importance.
>>>
>>> Martin
>>>
>>>> R> str(LAML.RNAseq.updated at assays)
>>>> Reference class 'ShallowSimpleListAssays' [package "GenomicRanges"] with
>>>> 1 fields
>>>>    $ data:Formal class 'SimpleList' [package "IRanges"] with 4 slots
>>>>     .. ..@ listData       :List of 1
>>>>     .. .. ..$ rpm: num [1:23529, 1:173] 0 0 8.12 10.84 27.73 ...
>>>>     .. ..@ elementType    : chr "ANY"
>>>>     .. ..@ elementMetadata: NULL
>>>>     .. ..@ metadata       : list()
>>>>    and 11 methods,R>
>>>> R> packageVersion('GenomicRanges')
>>>> [1] ‘1.9.59’
>>>>
>>>> A slightly bigger one:
>>>>
>>>> R> head(lsos(),2)
>>>>                                Type       Size   Rows Columns
>>>> LAML         SummarizedExperiment 1950606472 485577     192
>>>> LAML.updated SummarizedExperiment  458912760 485577     192
>>>> R> system.time(LAML.updated$foo <- rep('bar', 192))
>>>>      user  system elapsed
>>>>     0.132   0.116   0.248
>>>> R> system.time(LAML$foo <- rep('bar', 192))
>>>>      user  system elapsed
>>>>     2.728   2.144   7.519
>>>>
>>>>
>>>> That is a really clever hack you added.  I think I will have to figure
>>>> out how it works :-)
>>>>
>>>> Thanks!!!
>>>>
>>>>
>>>> On Sun, Sep 2, 2012 at 2:47 PM, Martin Morgan <mtmorgan at fhcrc.org
>>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>
>>>>      On 08/30/2012 08:47 AM, Vincent Carey wrote:
>>>>
>>>>          I am all in favor of the dialogue and the aim.  I reproduced
>>>>          your first
>>>>          timing, and note
>>>>
>>>>              unix.time(ss2 at colData$tx <- 1:1000)
>>>>
>>>>               user  system elapsed
>>>>              5.963   4.520  10.483
>>>>
>>>>              unix.time(colData(ss2)$tx2 <- 1:1000)
>>>>
>>>>               user  system elapsed
>>>>             17.937  13.074  31.016
>>>>
>>>>          BTW did you really mean to put the meth/unmeth in exptData, or
>>>>          should it be
>>>>          in assays?
>>>>
>>>>               From an experimentation perspective, I would want to redo
>>>> these
>>>>
>>>>          computations above with an environment holding the assay data
>>>>          and see how
>>>>          it changes.  I suppose one can infer from first principles that
>>>>
>>>>              ss4 at assays[[1]]$meth[1:4,1:4]
>>>>
>>>>                 [,1] [,2] [,3] [,4]
>>>>          [1,]  0.2  0.2  0.2  0.2
>>>>          [2,]  0.2  0.2  0.2  0.2
>>>>          [3,]  0.2  0.2  0.2  0.2
>>>>          [4,]  0.2  0.2  0.2  0.2
>>>>
>>>>              ss4 at assays[[2]]$unmeth[1:4,1:__4]
>>>>
>>>>
>>>>                 [,1] [,2] [,3] [,4]
>>>>          [1,]  0.2  0.2  0.2  0.2
>>>>          [2,]  0.2  0.2  0.2  0.2
>>>>          [3,]  0.2  0.2  0.2  0.2
>>>>          [4,]  0.2  0.2  0.2  0.2
>>>>
>>>>              unix.time(colData(ss4)$tx2 <- 1:1000)
>>>>
>>>>               user  system elapsed
>>>>              0.010   0.002   0.012
>>>>
>>>>          where ss4 has a SimpleList of environments holding the assay
>>>> data.
>>>>
>>>>              ss4
>>>>
>>>>          class: SummarizedExperiment
>>>>          dim: 450000 1000
>>>>          exptData(0):
>>>>          assays(2): '' ''
>>>>          rownames: NULL
>>>>          rowData values names(0):
>>>>          colnames: NULL
>>>>          colData names(4): sampleID treatment tx tx2
>>>>
>>>>              ss4 at assays[[1]]
>>>>
>>>>          <environment: 0x1f4e0ba0>
>>>>
>>>>          This can't be attempted without disabling validity checking, and
>>>>          the stack
>>>>          trace from the attempt is instructive.
>>>>
>>>>          I don't think it will be too difficult to establish a customized
>>>>          SummarizedExperiment that permits this, using the
>>>> lockedEnvironment
>>>>          approach of ExpressionSets.  I also do not know the down sides
>>>>          but it seems
>>>>          you/we could get some mileage on the approach and see.
>>>>
>>>>
>>>>      GenomicRanges v. 1.9.59 has some preliminary changes that address
>>>>      the performance issue. Serialized (saved) objects 'x' need to be
>>>>      updated with updateObject(x) (and be careful if these are your only
>>>>      instances of some hard work). There are some loose ends to tidy up.
>>>>
>>>>      Martin
>>>>
>>>>
>>>>
>>>>
>>>>          On Thu, Aug 30, 2012 at 10:53 AM, Kasper Daniel Hansen <
>>>>          kasperdanielhansen at gmail.com
>>>>          <mailto:kasperdanielhansen at gmail.com>> wrote:
>>>>
>>>>              My perspective (and the reason for me email) is pretty
>>>> simple.
>>>>
>>>>              First off, let me say that we will always be working on
>>>>              trading off
>>>>              memory and IO and development time and all the other stuff.
>>>>                I mean,
>>>>              clearly the environment stuff for ExpressionSets did not
>>>>              help all
>>>>              people, because we have xps, aroma.affymetrix and oligo all
>>>>              using some
>>>>              kind of file based backend.
>>>>
>>>>              I am not trying to solve something "big".  Putting
>>>>              methylation data
>>>>              inside an environment is not suddenly going to make my
>>>>              algorithms go
>>>>              _significantly_ faster or the memory usage to go
>>>>              _significantly_ down.
>>>>                 But it makes life faster when you explore data - I
>>>> believe.
>>>>
>>>>              Now for two realistic examples (one minfi, one bsseq) that I
>>>>              encounter
>>>>              all the time:
>>>>              library(GenomicRanges)
>>>>              nRows <- 450000
>>>>              nCols <- 1000
>>>>              SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>              rep("chr1",
>>>>              nRows),
>>>>                                             ranges = IRanges(1:nRows,
>>>>              width = 1)),
>>>>                                             colData = DataFrame(sampleID =
>>>>              paste0("person", 1:nCols)),
>>>>                                             exptData = SimpleList(meth =
>>>>              matrix(0.2,
>>>>              ncol = nCols, nrow = nRows),
>>>>                                             unmeth = matrix(0.2, ncol =
>>>>              nCols, nrow =
>>>>              nRows)))
>>>>              print(object.size(SSet), units = "auto")  # 6.7GB
>>>>              system.time({
>>>>                    colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
>>>>              }) # around 26 sec
>>>>
>>>>
>>>>              nRows <- 28000000
>>>>              nCols <- 10
>>>>              SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>              rep("chr1",
>>>>              nRows),
>>>>                                             ranges = IRanges(1:nRows,
>>>>              width = 1)),
>>>>                                             colData = DataFrame(sampleID =
>>>>              paste0("person", 1:nCols)),
>>>>                                             exptData = SimpleList(meth =
>>>>              matrix(0.2,
>>>>              ncol = nCols, nrow = nRows),
>>>>                                             unmeth = matrix(0.2, ncol =
>>>>              nCols, nrow =
>>>>              nRows)))
>>>>              print(object.size(SSet), units = "auto")  # 4.4GB
>>>>              system.time({
>>>>                    colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
>>>>              }) # around 20 sec
>>>>
>>>>              This is of course not that slow, especially not when
>>>>              compared to the
>>>>              massive amount of computation I had to do in order to arrive
>>>>              at the
>>>>              SSet in the first place.  But the difference is that I
>>>>              staring at the
>>>>              screen when I work interactively, but I am out partying when
>>>>              I run my
>>>>              computations (overnight).  And I do stuff like the above
>>>>              surprisingly
>>>>              many times after I have generated the objects.
>>>>
>>>>              My perspective is that this was essentially "solved" with the
>>>>              ExpressionSet class, and I feel we are regressing a bit.  I
>>>> also
>>>>              believe that it should be "easy" to re-use some of the
>>>>              ideas/code from
>>>>              ExpressionSet and that it is worthwhile to do so, for
>>>>              something that
>>>>              may very well be an extremely important core class.
>>>>              SummarizedExperiment is such a nice class that I essentially
>>>>              have a
>>>>              parallel implementation in bsseq and Pete has one in genoset.
>>>>
>>>>              So essentially, I was interested in fixing a pretty small
>>>>              problem, but
>>>>              something that I am a bit irritated about.  And I feel we
>>>>              almost have
>>>>              a ready to use solution.
>>>>
>>>>              The reason I ask about it here, and not just do it myself,
>>>>              is twofold
>>>>              (1) it is tricky to get the environment right, and I do not
>>>> have
>>>>              enough experience on the environment part of ExpressionSet,
>>>>              that I
>>>>              feel comfortable doing it
>>>>              (2) I really think it would be beneficial to the community,
>>>> and
>>>>              experience tells us, that what happens in these core
>>>>              infrastructure
>>>>              packages are much more likely to be widely used.
>>>>
>>>>              While I think that it may be worthwhile for people to play
>>>> with
>>>>              reference classes, I think that going down that route is
>>>>              unexplored
>>>>              and may or may not be worth it.  There is something to be
>>>>              said for
>>>>              building on something we have experience with.  A reference
>>>>              class idea
>>>>              can be developed in parallel, if people are willing to
>>>>              commit the
>>>>              time.
>>>>
>>>>              Kasper
>>>>
>>>>
>>>>              On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr.
>>>>              <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>>>
>>>>              wrote:
>>>>
>>>>                     I am interested in how best to benchmark such
>>>>                  operations (and another
>>>>                  thought that I recently had, "why can't I put a Doug
>>>>                  Bates style 'Matrix'
>>>>                  in the SimpleList of assay data)?  Kasper is the person
>>>>                  who convinced me
>>>>                  that ff, rhdf5, etc.should be a last resort, since RAM
>>>>                  is cheap.
>>>>
>>>>                  For large data, the TCGA BRCA samples are a hair under
>>>>                  800 (last I
>>>>
>>>>              looked)
>>>>
>>>>                  450k arrays with matching RNAseq and DNAseq, and that's
>>>>                  not even a big
>>>>                  study by breast cancer standards (IIRC, Gordon Mills was
>>>>                  rounding up 5000
>>>>                  tumors) Both minfi and methylumi can work off of the raw
>>>>                  data in these
>>>>                  studies as a "practical" example, but when it really
>>>>                  gets interesting is
>>>>                  when 3-4 ragged assays exist for each patient.
>>>>
>>>>                  Exon-wise (or splice-graph-wise) RNAseq data with splice
>>>>                  junction read
>>>>                  counts is another place to find big SE-centric data with
>>>>                  ambiguities.
>>>>
>>>>              One
>>>>
>>>>                  of the reasons I keep coming back to the idea of
>>>>                  subsetting an SE by a
>>>>                  GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is
>>>>                  its speed. In any
>>>>                  event, after my obligations for the day are dispatched,
>>>>                  I'll write up an
>>>>                  example of doing this off of TCGA data, perhaps from
>>>>                  colorectal (only
>>>>
>>>>              about
>>>>
>>>>                  400 subjects but at least it is published already).  You
>>>>                  are right, a
>>>>                  single use case is often worth 1000 specifications.
>>>>
>>>>                  --t
>>>>
>>>>
>>>>
>>>>                  On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
>>>>                  <stvjc at channing.harvard.edu
>>>>                  <mailto:stvjc at channing.harvard.edu>>__wrote:
>>>>
>>>>
>>>>
>>>>
>>>>                      On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis
>>>>                      <sdavis2 at mail.nih.gov <mailto:sdavis2 at mail.nih.gov>>
>>>>
>>>>
>>>>              wrote:
>>>>
>>>>
>>>>
>>>>
>>>>                          On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan
>>>>                          <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>>>>
>>>>
>>>>                  wrote:
>>>>
>>>>
>>>>                              On 08/30/2012 04:42 AM, Vincent Carey wrote:
>>>>
>>>>                                  On Thu, Aug 30, 2012 at 6:27 AM, Tim
>>>>                                  Triche, Jr. <
>>>>
>>>>              tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>
>>>>
>>>>                                      wrote:
>>>>
>>>>
>>>>                                     nb.  one of the reasons for the
>>>>                                  existence of the MergedDataSet
>>>>
>>>>              class in
>>>>
>>>>                                      regulatoR (to be submitted for
>>>>                                      review shortly) is that, while SEs
>>>>
>>>>              are
>>>>
>>>>                                      absolutely fantastic for managing
>>>>                                      data matrices that are stapled to
>>>>
>>>>              a
>>>>
>>>>                                      GRanges, what is less awesome is
>>>>                                      having a relatively light-weight
>>>>                                      DataFrame
>>>>                                      for phenotypic data that requires
>>>>                                      the entire memory footprint be
>>>>                                      recreated
>>>>                                      upon writing a new column into said
>>>>                                      DataFrame.
>>>>
>>>>                                      If R5 classes didn't spook me a
>>>>                                      little, I would already have done
>>>>                                      something
>>>>
>>>>
>>>>                                  We don't often use this R5 terminology
>>>>                                  but I see Hadley has made an
>>>>                                  accessible document referring to
>>>>                                  reference classes in this way.
>>>>
>>>>
>>>>                              To me the challenge is more conceptual --
>>>>                              pass-by-reference and the
>>>>
>>>>              way
>>>>
>>>>                              that two variables pointing to the instance
>>>>                              are updated at the same
>>>>
>>>>              time --
>>>>
>>>>                              and I had been thinking of a
>>>>                              LockedEnvironment-style implementation
>>>>
>>>>              where
>>>>
>>>>                              some operations were free ('copying') but
>>>>                              others weren't (subset,
>>>>
>>>>              subset
>>>>
>>>>                              assign). But maybe there are some more
>>>>                              direct approaches...
>>>>
>>>>
>>>>                                 My 2c: This is a situation where some
>>>>                              experimental data would be
>>>>
>>>>                                  helpful.
>>>>
>>>>
>>>>
>>>>                          Perhaps Tim or Kasper can share a largish
>>>>                          (unpublished) dataset and a
>>>>                          typical workflow with the Seattle folks.  Even
>>>>                          that level of detail
>>>>
>>>>              would
>>>>
>>>>                          give some sense of the scope of the problem.
>>>>
>>>>
>>>>
>>>>                      For clarification, I did not mean large data,
>>>>                      although that would be
>>>>                      welcome.  I meant data on computational experiments
>>>>                      with different
>>>>                      approaches -- that is, performance statistics that
>>>>                      would indicate where
>>>>                      scalability fails, what steps have been taken to
>>>>                      recover it, and how
>>>>
>>>>              costly
>>>>
>>>>                      those steps are in terms of complexity, reliability,
>>>>                      maintainability,
>>>>
>>>>              risk
>>>>
>>>>                      at the user/application end.
>>>>
>>>>
>>>>
>>>>                          Sean
>>>>
>>>>
>>>>                              Yes, for instance where in the interactive
>>>>                              use is time being spent? Is
>>>>                              it copying the assays, or validity, or
>>>>                              actually updating the row
>>>>
>>>>              data? Is
>>>>
>>>>                              500000 x 800 an appropriate scale to be
>>>>                              thinking about?
>>>>
>>>>
>>>>                                   The main avenues for a developer seem
>>>>                              to be a) use environments or
>>>>
>>>>                                  reference classes; there are some costs
>>>>                                  and we should understand
>>>>
>>>>              them,
>>>>
>>>>                                  and
>>>>                                  b) use an out-of-memory approach like
>>>>                                  rhdf5 or ff.  Again there will
>>>>
>>>>              be
>>>>
>>>>                                  some costs.  It should be relatively
>>>>                                  easy to experiment with these.
>>>>
>>>>                 One
>>>>
>>>>                                  thing I just learned about is
>>>>                                  setValidity2 and disableValidity
>>>>
>>>>              (defined
>>>>
>>>>                                  in
>>>>                                  IRanges IIRC) ... these allow you to
>>>>                                  construct certain variations on
>>>>                                  SummarizedExperiment with less attention
>>>>                                  to deeper infrastructure.
>>>>
>>>>
>>>>                              probably I can make better use of the
>>>>                              insights the IRanges guys have
>>>>
>>>>              had
>>>>
>>>>                              in their careful development and application
>>>>                              of validity methods,
>>>>
>>>>              though I
>>>>
>>>>                              feel a bit like these are 'attractive
>>>>                              hazards' that tempt us to do
>>>>
>>>>              unsafe
>>>>
>>>>                              things and then pay the price later. This is
>>>>                              likely the first
>>>>
>>>>              direction
>>>>
>>>>                              I'll explore.
>>>>
>>>>                              Exploring a little I already see that there
>>>>                              are some pretty dumb
>>>>
>>>>              things
>>>>
>>>>                              being done in assignment.
>>>>
>>>>                              Martin
>>>>
>>>>
>>>>                                 whereby the assays/libraries for a given
>>>>                              study subject are all
>>>>
>>>>              pointed
>>>>
>>>>                                      to
>>>>                                      as SEs (i.e. RNAseq, BSseq,
>>>>                                      expression/methylation arrays,
>>>> CNV/SNP
>>>>                                      arrays,
>>>>                                      WGS or exomic DNAseq) and the column
>>>>                                      (phenotype) data can avoid
>>>>
>>>>              being
>>>>
>>>>                                      subject to these constraints.  Truth
>>>>                                      be told I *still* want to do
>>>>
>>>>              that
>>>>
>>>>                                      because, most of the time, updates
>>>>                                      to the latter are independent of,
>>>>                                      and
>>>>                                      happen subsequently to, loading the
>>>>                                      former.
>>>>
>>>>                                      Suggestions would be welcome,
>>>>                                      because other than these minor
>>>>
>>>>              niggles,
>>>>
>>>>                                      the
>>>>                                      SummarizedExperiment class is almost
>>>>                                      perfect for many tasks.
>>>>
>>>>
>>>>
>>>>
>>>>                                      On Wed, Aug 29, 2012 at 9:57 PM, Tim
>>>>                                      Triche, Jr. <
>>>>
>>>>              tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>
>>>>
>>>>
>>>>                                          wrote:
>>>>
>>>>
>>>>                                         assigning new colData columns, or
>>>>                                      overwriting old ones, in a
>>>>
>>>>              sizable
>>>>
>>>>                                          (say
>>>>                                          500000 row x 800 column) SE is
>>>>                                          nauseatingly slow.
>>>>
>>>>
>>>>
>>>>
>>>>                                          There has to be a better way --
>>>>                                          I'm willing to write it if
>>>> someone
>>>>
>>>>              can
>>>>
>>>>                                          point out an obvious way to do it
>>>>
>>>>
>>>>
>>>>                                          On Wed, Aug 29, 2012 at 9:52 PM,
>>>>                                          Kasper Daniel Hansen <
>>>>                                          kasperdanielhansen at gmail.com
>>>>
>>>> <mailto:kasperdanielhansen at gmail.com>>
>>>>
>>>>                                          wrote:
>>>>
>>>>                                             On Thu, Aug 30, 2012 at 12:44
>>>>                                          AM, Martin Morgan <
>>>>
>>>>              mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>>>>
>>>>
>>>>                                              wrote:
>>>>
>>>>                                                  On 08/29/2012 06:46 PM,
>>>>                                                  Kasper Daniel Hansen
>>>> wrote:
>>>>
>>>>
>>>>                                                      There is a lot of
>>>>                                                      good stuff to say
>>>>                                                      about
>>>>
>>>> SummarizedExperiments,
>>>>                                                      and
>>>>                                                      from a certain point
>>>>                                                      of view I have a
>>>>                                                      parallel
>>>>                                                      implementation in
>>>>
>>>>                                                  bsseq
>>>>
>>>>
>>>>                                             (and there is also one in
>>>>                                          genoset).
>>>>
>>>>
>>>>                                                      However, I really
>>>>                                                      like having the
>>>>                                                      assayData inside an
>>>>
>>>>              environment.
>>>>
>>>>                                                      This helps some on
>>>>                                                      memory and - equally
>>>>                                                      important - speed at
>>>> the
>>>>                                                      command line.  I
>>>>                                                      certainly need to
>>>>                                                      very heavily
>>>>                                                      consider using
>>>>
>>>>              an
>>>>
>>>>                                                      environment in bsseq.
>>>>
>>>>                                                      After some
>>>>                                                      discussion with Tim
>>>>                                                      (Triche) we have
>>>>                                                      agreed that
>>>>                                                      something
>>>>                                                      like
>>>>                                                      SummarizedExperiments
>>>> is
>>>>                                                      the way to go at
>>>>                                                      least for the
>>>>                                                      methylation arrays.
>>>>                                                        We need to be able
>>>>                                                      to easily handle
>>>> 1000s
>>>>
>>>>              of
>>>>
>>>>                                                      samples.
>>>>
>>>>                                                      What is the chance
>>>>                                                      that we can get the
>>>>                                                      option of having the
>>>>                                                      assayData
>>>>                                                      inside an
>>>>                                                      environment, perhaps
>>>> by
>>>>                                                            Making a class
>>>>                                                      that is an
>>>>                                                      environment and
>>>>                                                      inherits from
>>>>
>>>>                                                  SimpleList.
>>>>
>>>>
>>>>                                                 Using a classUnion
>>>>                                          between the existing class of
>>>>                                          the assayData
>>>>
>>>>                                                      and
>>>>                                                      an environment.
>>>>                                                            Third option
>>>>                                                      that is probably
>>>>                                                      better than the
>>>>                                                      proceeding
>>>>
>>>>              two,
>>>>
>>>>                                                      but
>>>>                                                      which I cannot come
>>>>                                                      up with right now.
>>>>
>>>>
>>>>
>>>>                                                  Probably something can /
>>>>                                                  will be done. I guess
>>>>                                                  the slowness
>>>>
>>>>              you're
>>>>
>>>>
>>>>                                              talking
>>>>
>>>>                                                  about is when rowData /
>>>>                                                  colData columns are
>>>>                                                  manipulated; any
>>>>
>>>>              kind of
>>>>
>>>>                                                  subsetting would mean a
>>>>                                                  'deep' copy. Martin
>>>>
>>>>
>>>>                                              Yes, for example
>>>>                                              manipulating colData -
>>>>                                              something that
>>>>
>>>>              conceptually
>>>>
>>>>                                              should be quick and easy.
>>>>                                                Of course, this will not
>>>>                                              affect any
>>>>
>>>>              real
>>>>
>>>>                                              computation on the assayData
>>>>                                              matrices, but it will make
>>>>                                              life at
>>>>
>>>>              the
>>>>
>>>>                                              command prompt more pleasant.
>>>>
>>>>                                              Kasper
>>>>
>>>>
>>>>
>>>>
>>>>                                                      This would - in my
>>>>                                                      opinion - be very
>>>>                                                      nice and worthwhile.
>>>>
>>>>                                                      Kasper
>>>>
>>>>
>>>> ________________________________**_________________
>>>>
>>>> Bioc-devel at r-project.org
>>>>
>>>> <mailto:Bioc-devel at r-project.org>
>>>>                                                      mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>>                                                  --
>>>>                                                  Computational Biology /
>>>>                                                  Fred Hutchinson Cancer
>>>>                                                  Research Center
>>>>                                                  1100 Fairview Ave. N.
>>>>                                                  PO Box 19024 Seattle, WA
>>>>                                                  98109
>>>>
>>>>                                                  Location: Arnold
>>>>                                                  Building M1 B861
>>>>                                                  Phone: (206) 667-2793
>>>>
>>>> <tel:%28206%29%20667-2793>
>>>>
>>>>
>>>>
>>>> ________________________________**_________________
>>>>                                              Bioc-devel at r-project.org
>>>>
>>>> <mailto:Bioc-devel at r-project.org>
>>>>                                              mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                                          --
>>>>                                          *A model is a lie that helps you
>>>>                                          see the truth.*
>>>>                                          *
>>>>                                          *
>>>>                                          Howard Skipper<
>>>>
>>>>
>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>
>>>>
>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>
>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>
>>>>
>>>>                                      **>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                                      --
>>>>                                      *A model is a lie that helps you see
>>>>                                      the truth.*
>>>>                                      *
>>>>                                      *
>>>>                                      Howard Skipper<
>>>>
>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>
>>>>
>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>
>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>
>>>>
>>>>                                      **>
>>>>
>>>>                                                 [[alternative HTML
>>>>                                      version deleted]]
>>>>
>>>>
>>>> ________________________________**_________________
>>>>                                      Bioc-devel at r-project.org
>>>>                                      <mailto:Bioc-devel at r-project.org>
>>>>                                      mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>                                            [[alternative HTML version
>>>>                                  deleted]]
>>>>
>>>>
>>>> ________________________________**_________________
>>>>                                  Bioc-devel at r-project.org
>>>>                                  <mailto:Bioc-devel at r-project.org>
>>>>                                  mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>>                              --
>>>>                              Computational Biology / Fred Hutchinson
>>>>                              Cancer Research Center
>>>>                              1100 Fairview Ave. N.
>>>>                              PO Box 19024 Seattle, WA 98109
>>>>
>>>>                              Location: Arnold Building M1 B861
>>>>                              Phone: (206) 667-2793
>>>> <tel:%28206%29%20667-2793>
>>>>
>>>>
>>>> ________________________________**_________________
>>>>                              Bioc-devel at r-project.org
>>>>                              <mailto:Bioc-devel at r-project.org> mailing
>>>> list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>>              https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>>              <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                  --
>>>>                  *A model is a lie that helps you see the truth.*
>>>>                  *
>>>>                  *
>>>>                  Howard Skipper<
>>>>
>>>>
>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>
>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>
>>>>
>>>>                            [[alternative HTML version deleted]]
>>>>
>>>>                  _________________________________________________
>>>>                  Bioc-devel at r-project.org
>>>>                  <mailto:Bioc-devel at r-project.org> mailing list
>>>>                  https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>>                  <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>>
>>>>                   [[alternative HTML version deleted]]
>>>>
>>>>          _________________________________________________
>>>>          Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>          mailing list
>>>>          https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>>          <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>>
>>>>      --
>>>>      Computational Biology / Fred Hutchinson Cancer Research Center
>>>>      1100 Fairview Ave. N.
>>>>      PO Box 19024 Seattle, WA 98109
>>>>
>>>>      Location: Arnold Building M1 B861
>>>>      Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> /A model is a lie that helps you see the truth./
>>>> /
>>>> /
>>>> Howard Skipper
>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>
>>
>>
>>
>> --
>> A model is a lie that helps you see the truth.
>>
>> Howard Skipper
>>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list