[Bioc-devel] SummarizedExperiments
Martin Morgan
mtmorgan at fhcrc.org
Wed Sep 12 21:33:54 CEST 2012
On 09/12/2012 12:15 PM, Kasper Daniel Hansen wrote:
> On Wed, Sep 12, 2012 at 2:23 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 09/06/2012 02:26 PM, Kasper Daniel Hansen wrote:
>>>
>>> Alltogether, this looks great. I am still implementing/testing, but
>>> seems pretty nice.
>>>
>>> Here is a number of convenience methods I believe should be
>>> implemented. Some of them comes from eSet and some of them comes from
>>> my own work in bsseq, but I have used all of them extensively in
>>> various packages.
>>>
>>> sampleNames, featureNames, pData
>>> granges, start, end, width, strand, seqnames, seqlengths, seqlevels
>>> findOverlaps, subsetByOverlaps
>>
>>
>> GenomicRanges 1.9.65 in Bioc devel makes the following operations on
>> SummarizedExperiment work on the underlying rowData; I particularly like
>> subsetByOverslaps() for selecting, e.g., SummarizedExperiment rows that
>> overlap features of interest, and seqlevels(x, force=TRUE) = paste("chr",
>> 1:5), for instance, for selecting sequences of interest. See the section
>> 'GRanges compatibility (rowData access)' on ?SummarizedExperiment.
>
> Nice list, this is useful. Now I guess I need to update bsseq.
Should have mentioned (even if not directly relevant to bsseq) that
classes that extend SummarizedExperiment, like VCF, get these automatically.
> seqselect is nice, but also destructive. In my own work I have found
> it quite useful to be able to subset on a whole chromosome, without
> destroying the old object, like
> function(sset, "chr1")
sset[seqnames(sset) == "chr1",]
> I know this can be done by a long subsetByOverlaps, but - like
> seqselect - it is nice to have a quicker way of doing it.
>
> I don't see any mention of
> sset[gr,]
> Is this planned or was it decided to drop this?
In the end I still don't like the semantic fit between [ and the overlap
implied by sset[gr,]
> One thing I have in my package that I find indispensable is combine
> and (my own) combineList. The later for combining > 2 objects, which
> has a lot of possibilities for speed up especially if (very common)
> all the objects have the same rowData, as opposed to Reduce(combine,
> LIST).. Usecase: you need to add additional samples to your
> SummarizedExperiment.
I found it difficult in Biobase to write combine methods for eSet, where
you're really requiring a lot from the user (about the phenoData /
featureData structured in the same way) or going through contortions to
make it the same in a reasonable-but-ad-hoc way (e.g., when two columns
are factors with the same set of levels but encoded differently). Maybe
the effort required is proportional to the utility of the function
provided... I'll give it some more thought.
Martin
> My code (which may utilize that width(object) == 1 is in bsseq/R/BSseq-class.R
>
> Kasper
>
>
>
>>
>> Martin
>>
>> compare, countOverlaps, coverage, disjointBins, distance, distanceToNearest,
>> duplicated, end, end<-, findOverlaps, flank, follow, granges, isDisjoint,
>> match, mcols, mcols<-, narrow, nearest, order, precede, ranges, ranges<-,
>> rank, resize, restrict, seqinfo, seqinfo<-, seqnames, shift, sort, start,
>> start<-, strand, strand<-, width, width<-.
>>
>>
>>>
>>> 'granges' is a method already defined in GenomicRanges, I have found
>>> it very convenient. It is not clear to me why it is all lowercase
>>>
>>> What I have done in bsseq for the later two lines of methods, is to
>>> have a class 'hasGRanges' that designate any object with a GRanges.
>>> Then implement the granges() method and the rest goes from there. Not
>>> sure whether that makes sense in this context, but in general I have
>>> found it very convenient for my own work. Code for all of this is in
>>> bsseq/R/hasGRanges.R. I have two classes in the package inheriting
>>> from hasGRanges. Note that the classes in bsseq have not been updated
>>> to build on the new SummarizedExperiment (yet).
>>>
>>> Kasper
>>>
>>>
>>> On Wed, Sep 5, 2012 at 1:31 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>> wrote:
>>>>
>>>> It seems to work with a Matrix in the assays:
>>>>
>>>> R> packageVersion('GenomicRanges')
>>>> [1] ‘1.9.61’
>>>>
>>>> R> SE <- SummarizedExperiment(assays=wts, rowData=wts.locs)
>>>> Error in function (classes, fdef, mtable) :
>>>> unable to find an inherited method for function
>>>> "SummarizedExperiment",
>>>> for signature "dgCMatrix"
>>>>
>>>> R> SE <- SummarizedExperiment(assays=SimpleList(weights=wts),
>>>> rowData=wts.locs)
>>>> dimnames(.) <- NULL: translated to
>>>> dimnames(.) <- list(NULL,NULL) <==> unname(.)
>>>>
>>>> R> show(SE)
>>>> class: SummarizedExperiment
>>>> dim: 35855 35855
>>>> exptData(0):
>>>> assays(1): weights
>>>> rownames(35855): feat1 feat2 ... feat35854 feat35855
>>>> rowData metadata column names(10): egid signed ... DHS_DGF chromHMM_state
>>>> colnames(35855): feat1 feat2 ... feat35854 feat35855
>>>> colData names(0):
>>>>
>>>> R> class(assays(SE, 'weights'))
>>>> [1] "SimpleList"
>>>> attr(,"package")
>>>> [1] "IRanges"
>>>>
>>>> R> class(assays(SE)$weights)
>>>> [1] "dgCMatrix"
>>>> attr(,"package")
>>>> [1] "Matrix"
>>>>
>>>>
>>>> Very cool! Albeit a tad disorienting. I was going to patch this up and
>>>> then realized, hey, if someone's going to stuff a Matrix into their
>>>> assays,
>>>> they ought to give it a name and know what's going on under the hood, at
>>>> least for now. It seems to work correctly, too.
>>>>
>>>> Oh oh, I spoke too soon. If I split by chromosome, the assay falls out:
>>>>
>>>> R> head(assays(SE)$weights)
>>>> 6 x 35855 sparse Matrix of class "dgCMatrix"
>>>> [[ suppressing 82 column names ‘feat1’, ‘feat2’, ‘feat3’ ... ]]
>>>>
>>>> feat1 1 . . . . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . .
>>>> feat2 . 1 . . . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . .
>>>> feat3 . . 1.0000000 0.5737488 . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . .
>>>> feat4 . . 0.5737488 1.0000000 . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . .
>>>> feat5 . . . . 1.00000000 0.04300373 . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . .
>>>> feat6 . . . . 0.04300373 1.00000000 . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>>> .
>>>> . . . . . . . . . . . .
>>>>
>>>> feat1 . . . . . . . . . . . . . . ......
>>>> feat2 . . . . . . . . . . . . . . ......
>>>> feat3 . . . . . . . . . . . . . . ......
>>>> feat4 . . . . . . . . . . . . . . ......
>>>> feat5 . . . . . . . . . . . . . . ......
>>>> feat6 . . . . . . . . . . . . . . ......
>>>>
>>>> .....suppressing columns in show(); maybe adjust 'options(max.print=
>>>> *)'
>>>> ..............................
>>>> R> SE.by.chr <- split(SE, as.vector(seqnames(SE)))
>>>> R> head(assays(SE.by.chr)$weights)
>>>> Error in head(assays(SE.by.chr)$weights) :
>>>> error in evaluating the argument 'x' in selecting a method for
>>>> function
>>>> 'head': Error in function (classes, fdef, mtable) :
>>>> unable to find an inherited method for function "assays", for
>>>> signature
>>>> "list"
>>>>
>>>> Ideas? I would love to do something useful here instead of just fussing.
>>>> It's really a great structure.
>>>>
>>>> I love the use of a reference class to reference the internal pieces of
>>>> an
>>>> object, it's such a clever hack. Thanks much for adding it.
>>>>
>>>> --t
>>>>
>>>>
>>>>
>>>> On Tue, Sep 4, 2012 at 7:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>>>
>>>>>
>>>>> On 09/04/2012 07:02 PM, Tim Triche, Jr. wrote:
>>>>>>
>>>>>>
>>>>>> This is really cool.
>>>>>>
>>>>>> R> head(lsos(), 1)
>>>>>> Type Size Rows Columns
>>>>>> LAML.RNAseq SummarizedExperiment 38666528 23529 173
>>>>>> R> LAML.RNAseq.updated <- updateObject(LAML.RNAseq)
>>>>>> R> head(lsos(), 2)
>>>>>> Type Size Rows Columns
>>>>>> LAML.RNAseq SummarizedExperiment 38666528 23529 173
>>>>>> LAML.RNAseq.updated SummarizedExperiment 6101488 23529 173
>>>>>
>>>>>
>>>>>
>>>>> I think either the sizes are misleading (missing the content of the
>>>>> environment that the reference classes contain, probably
>>>>> LAML.RNAseq.updated at assays$data has size 38666528 - 6101488) or the data
>>>>> has
>>>>> not been updated correctly :\. Nonetheless manipulation of especially
>>>>> non-assay but also assay data should be faster.
>>>>>
>>>>> I added a brief section (1.9.60) to ?SummarizedExperiment describing
>>>>> what
>>>>> was done. Also, I tried (w/out actually testing...) to add support for
>>>>> Matrix instances.
>>>>>
>>>>> As a miscellaneous 'tip', when working with large data I find it useful
>>>>> to
>>>>> start R with --min-vsize=2048M --min-nsize=20M. These are documented in
>>>>> RShowDoc("R-intro") as 'for expert use only' and influence how much
>>>>> memory R
>>>>> allocates for vectors ('vsize'; vsize can be very large) and for
>>>>> S-expressions ('nsize' 50000000) and help get up to large memory
>>>>> allocations
>>>>> without innumerable garbage collections. Of course having big memory is
>>>>> of
>>>>> primary importance.
>>>>>
>>>>> Martin
>>>>>
>>>>>> R> str(LAML.RNAseq.updated at assays)
>>>>>> Reference class 'ShallowSimpleListAssays' [package "GenomicRanges"]
>>>>>> with
>>>>>> 1 fields
>>>>>> $ data:Formal class 'SimpleList' [package "IRanges"] with 4 slots
>>>>>> .. ..@ listData :List of 1
>>>>>> .. .. ..$ rpm: num [1:23529, 1:173] 0 0 8.12 10.84 27.73 ...
>>>>>> .. ..@ elementType : chr "ANY"
>>>>>> .. ..@ elementMetadata: NULL
>>>>>> .. ..@ metadata : list()
>>>>>> and 11 methods,R>
>>>>>> R> packageVersion('GenomicRanges')
>>>>>> [1] ‘1.9.59’
>>>>>>
>>>>>> A slightly bigger one:
>>>>>>
>>>>>> R> head(lsos(),2)
>>>>>> Type Size Rows Columns
>>>>>> LAML SummarizedExperiment 1950606472 485577 192
>>>>>> LAML.updated SummarizedExperiment 458912760 485577 192
>>>>>> R> system.time(LAML.updated$foo <- rep('bar', 192))
>>>>>> user system elapsed
>>>>>> 0.132 0.116 0.248
>>>>>> R> system.time(LAML$foo <- rep('bar', 192))
>>>>>> user system elapsed
>>>>>> 2.728 2.144 7.519
>>>>>>
>>>>>>
>>>>>> That is a really clever hack you added. I think I will have to figure
>>>>>> out how it works :-)
>>>>>>
>>>>>> Thanks!!!
>>>>>>
>>>>>>
>>>>>> On Sun, Sep 2, 2012 at 2:47 PM, Martin Morgan <mtmorgan at fhcrc.org
>>>>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>>
>>>>>> On 08/30/2012 08:47 AM, Vincent Carey wrote:
>>>>>>
>>>>>> I am all in favor of the dialogue and the aim. I reproduced
>>>>>> your first
>>>>>> timing, and note
>>>>>>
>>>>>> unix.time(ss2 at colData$tx <- 1:1000)
>>>>>>
>>>>>> user system elapsed
>>>>>> 5.963 4.520 10.483
>>>>>>
>>>>>> unix.time(colData(ss2)$tx2 <- 1:1000)
>>>>>>
>>>>>> user system elapsed
>>>>>> 17.937 13.074 31.016
>>>>>>
>>>>>> BTW did you really mean to put the meth/unmeth in exptData, or
>>>>>> should it be
>>>>>> in assays?
>>>>>>
>>>>>> From an experimentation perspective, I would want to redo
>>>>>> these
>>>>>>
>>>>>> computations above with an environment holding the assay data
>>>>>> and see how
>>>>>> it changes. I suppose one can infer from first principles
>>>>>> that
>>>>>>
>>>>>> ss4 at assays[[1]]$meth[1:4,1:4]
>>>>>>
>>>>>> [,1] [,2] [,3] [,4]
>>>>>> [1,] 0.2 0.2 0.2 0.2
>>>>>> [2,] 0.2 0.2 0.2 0.2
>>>>>> [3,] 0.2 0.2 0.2 0.2
>>>>>> [4,] 0.2 0.2 0.2 0.2
>>>>>>
>>>>>> ss4 at assays[[2]]$unmeth[1:4,1:__4]
>>>>>>
>>>>>>
>>>>>> [,1] [,2] [,3] [,4]
>>>>>> [1,] 0.2 0.2 0.2 0.2
>>>>>> [2,] 0.2 0.2 0.2 0.2
>>>>>> [3,] 0.2 0.2 0.2 0.2
>>>>>> [4,] 0.2 0.2 0.2 0.2
>>>>>>
>>>>>> unix.time(colData(ss4)$tx2 <- 1:1000)
>>>>>>
>>>>>> user system elapsed
>>>>>> 0.010 0.002 0.012
>>>>>>
>>>>>> where ss4 has a SimpleList of environments holding the assay
>>>>>> data.
>>>>>>
>>>>>> ss4
>>>>>>
>>>>>> class: SummarizedExperiment
>>>>>> dim: 450000 1000
>>>>>> exptData(0):
>>>>>> assays(2): '' ''
>>>>>> rownames: NULL
>>>>>> rowData values names(0):
>>>>>> colnames: NULL
>>>>>> colData names(4): sampleID treatment tx tx2
>>>>>>
>>>>>> ss4 at assays[[1]]
>>>>>>
>>>>>> <environment: 0x1f4e0ba0>
>>>>>>
>>>>>> This can't be attempted without disabling validity checking,
>>>>>> and
>>>>>> the stack
>>>>>> trace from the attempt is instructive.
>>>>>>
>>>>>> I don't think it will be too difficult to establish a
>>>>>> customized
>>>>>> SummarizedExperiment that permits this, using the
>>>>>> lockedEnvironment
>>>>>> approach of ExpressionSets. I also do not know the down sides
>>>>>> but it seems
>>>>>> you/we could get some mileage on the approach and see.
>>>>>>
>>>>>>
>>>>>> GenomicRanges v. 1.9.59 has some preliminary changes that address
>>>>>> the performance issue. Serialized (saved) objects 'x' need to be
>>>>>> updated with updateObject(x) (and be careful if these are your
>>>>>> only
>>>>>> instances of some hard work). There are some loose ends to tidy
>>>>>> up.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 10:53 AM, Kasper Daniel Hansen <
>>>>>> kasperdanielhansen at gmail.com
>>>>>> <mailto:kasperdanielhansen at gmail.com>> wrote:
>>>>>>
>>>>>> My perspective (and the reason for me email) is pretty
>>>>>> simple.
>>>>>>
>>>>>> First off, let me say that we will always be working on
>>>>>> trading off
>>>>>> memory and IO and development time and all the other
>>>>>> stuff.
>>>>>> I mean,
>>>>>> clearly the environment stuff for ExpressionSets did not
>>>>>> help all
>>>>>> people, because we have xps, aroma.affymetrix and oligo
>>>>>> all
>>>>>> using some
>>>>>> kind of file based backend.
>>>>>>
>>>>>> I am not trying to solve something "big". Putting
>>>>>> methylation data
>>>>>> inside an environment is not suddenly going to make my
>>>>>> algorithms go
>>>>>> _significantly_ faster or the memory usage to go
>>>>>> _significantly_ down.
>>>>>> But it makes life faster when you explore data - I
>>>>>> believe.
>>>>>>
>>>>>> Now for two realistic examples (one minfi, one bsseq) that
>>>>>> I
>>>>>> encounter
>>>>>> all the time:
>>>>>> library(GenomicRanges)
>>>>>> nRows <- 450000
>>>>>> nCols <- 1000
>>>>>> SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>>> rep("chr1",
>>>>>> nRows),
>>>>>> ranges = IRanges(1:nRows,
>>>>>> width = 1)),
>>>>>> colData =
>>>>>> DataFrame(sampleID =
>>>>>> paste0("person", 1:nCols)),
>>>>>> exptData = SimpleList(meth
>>>>>> =
>>>>>> matrix(0.2,
>>>>>> ncol = nCols, nrow = nRows),
>>>>>> unmeth = matrix(0.2, ncol =
>>>>>> nCols, nrow =
>>>>>> nRows)))
>>>>>> print(object.size(SSet), units = "auto") # 6.7GB
>>>>>> system.time({
>>>>>> colData(SSet)$treatment = rep(c("a", "b"), nCols /
>>>>>> 2)
>>>>>> }) # around 26 sec
>>>>>>
>>>>>>
>>>>>> nRows <- 28000000
>>>>>> nCols <- 10
>>>>>> SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>>> rep("chr1",
>>>>>> nRows),
>>>>>> ranges = IRanges(1:nRows,
>>>>>> width = 1)),
>>>>>> colData =
>>>>>> DataFrame(sampleID =
>>>>>> paste0("person", 1:nCols)),
>>>>>> exptData = SimpleList(meth
>>>>>> =
>>>>>> matrix(0.2,
>>>>>> ncol = nCols, nrow = nRows),
>>>>>> unmeth = matrix(0.2, ncol =
>>>>>> nCols, nrow =
>>>>>> nRows)))
>>>>>> print(object.size(SSet), units = "auto") # 4.4GB
>>>>>> system.time({
>>>>>> colData(SSet)$treatment = rep(c("a", "b"), nCols /
>>>>>> 2)
>>>>>> }) # around 20 sec
>>>>>>
>>>>>> This is of course not that slow, especially not when
>>>>>> compared to the
>>>>>> massive amount of computation I had to do in order to
>>>>>> arrive
>>>>>> at the
>>>>>> SSet in the first place. But the difference is that I
>>>>>> staring at the
>>>>>> screen when I work interactively, but I am out partying
>>>>>> when
>>>>>> I run my
>>>>>> computations (overnight). And I do stuff like the above
>>>>>> surprisingly
>>>>>> many times after I have generated the objects.
>>>>>>
>>>>>> My perspective is that this was essentially "solved" with
>>>>>> the
>>>>>> ExpressionSet class, and I feel we are regressing a bit.
>>>>>> I
>>>>>> also
>>>>>> believe that it should be "easy" to re-use some of the
>>>>>> ideas/code from
>>>>>> ExpressionSet and that it is worthwhile to do so, for
>>>>>> something that
>>>>>> may very well be an extremely important core class.
>>>>>> SummarizedExperiment is such a nice class that I
>>>>>> essentially
>>>>>> have a
>>>>>> parallel implementation in bsseq and Pete has one in
>>>>>> genoset.
>>>>>>
>>>>>> So essentially, I was interested in fixing a pretty small
>>>>>> problem, but
>>>>>> something that I am a bit irritated about. And I feel we
>>>>>> almost have
>>>>>> a ready to use solution.
>>>>>>
>>>>>> The reason I ask about it here, and not just do it myself,
>>>>>> is twofold
>>>>>> (1) it is tricky to get the environment right, and I do
>>>>>> not
>>>>>> have
>>>>>> enough experience on the environment part of
>>>>>> ExpressionSet,
>>>>>> that I
>>>>>> feel comfortable doing it
>>>>>> (2) I really think it would be beneficial to the
>>>>>> community,
>>>>>> and
>>>>>> experience tells us, that what happens in these core
>>>>>> infrastructure
>>>>>> packages are much more likely to be widely used.
>>>>>>
>>>>>> While I think that it may be worthwhile for people to play
>>>>>> with
>>>>>> reference classes, I think that going down that route is
>>>>>> unexplored
>>>>>> and may or may not be worth it. There is something to be
>>>>>> said for
>>>>>> building on something we have experience with. A
>>>>>> reference
>>>>>> class idea
>>>>>> can be developed in parallel, if people are willing to
>>>>>> commit the
>>>>>> time.
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr.
>>>>>> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> I am interested in how best to benchmark such
>>>>>> operations (and another
>>>>>> thought that I recently had, "why can't I put a Doug
>>>>>> Bates style 'Matrix'
>>>>>> in the SimpleList of assay data)? Kasper is the
>>>>>> person
>>>>>> who convinced me
>>>>>> that ff, rhdf5, etc.should be a last resort, since RAM
>>>>>> is cheap.
>>>>>>
>>>>>> For large data, the TCGA BRCA samples are a hair under
>>>>>> 800 (last I
>>>>>>
>>>>>> looked)
>>>>>>
>>>>>> 450k arrays with matching RNAseq and DNAseq, and
>>>>>> that's
>>>>>> not even a big
>>>>>> study by breast cancer standards (IIRC, Gordon Mills
>>>>>> was
>>>>>> rounding up 5000
>>>>>> tumors) Both minfi and methylumi can work off of the
>>>>>> raw
>>>>>> data in these
>>>>>> studies as a "practical" example, but when it really
>>>>>> gets interesting is
>>>>>> when 3-4 ragged assays exist for each patient.
>>>>>>
>>>>>> Exon-wise (or splice-graph-wise) RNAseq data with
>>>>>> splice
>>>>>> junction read
>>>>>> counts is another place to find big SE-centric data
>>>>>> with
>>>>>> ambiguities.
>>>>>>
>>>>>> One
>>>>>>
>>>>>> of the reasons I keep coming back to the idea of
>>>>>> subsetting an SE by a
>>>>>> GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is
>>>>>> its speed. In any
>>>>>> event, after my obligations for the day are
>>>>>> dispatched,
>>>>>> I'll write up an
>>>>>> example of doing this off of TCGA data, perhaps from
>>>>>> colorectal (only
>>>>>>
>>>>>> about
>>>>>>
>>>>>> 400 subjects but at least it is published already).
>>>>>> You
>>>>>> are right, a
>>>>>> single use case is often worth 1000 specifications.
>>>>>>
>>>>>> --t
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
>>>>>> <stvjc at channing.harvard.edu
>>>>>> <mailto:stvjc at channing.harvard.edu>>__wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis
>>>>>> <sdavis2 at mail.nih.gov
>>>>>> <mailto:sdavis2 at mail.nih.gov>>
>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan
>>>>>> <mtmorgan at fhcrc.org
>>>>>> <mailto:mtmorgan at fhcrc.org>
>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On 08/30/2012 04:42 AM, Vincent Carey
>>>>>> wrote:
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 6:27 AM, Tim
>>>>>> Triche, Jr. <
>>>>>>
>>>>>> tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> nb. one of the reasons for the
>>>>>> existence of the MergedDataSet
>>>>>>
>>>>>> class in
>>>>>>
>>>>>> regulatoR (to be submitted for
>>>>>> review shortly) is that, while SEs
>>>>>>
>>>>>> are
>>>>>>
>>>>>> absolutely fantastic for managing
>>>>>> data matrices that are stapled to
>>>>>>
>>>>>> a
>>>>>>
>>>>>> GRanges, what is less awesome is
>>>>>> having a relatively light-weight
>>>>>> DataFrame
>>>>>> for phenotypic data that requires
>>>>>> the entire memory footprint be
>>>>>> recreated
>>>>>> upon writing a new column into
>>>>>> said
>>>>>> DataFrame.
>>>>>>
>>>>>> If R5 classes didn't spook me a
>>>>>> little, I would already have done
>>>>>> something
>>>>>>
>>>>>>
>>>>>> We don't often use this R5 terminology
>>>>>> but I see Hadley has made an
>>>>>> accessible document referring to
>>>>>> reference classes in this way.
>>>>>>
>>>>>>
>>>>>> To me the challenge is more conceptual --
>>>>>> pass-by-reference and the
>>>>>>
>>>>>> way
>>>>>>
>>>>>> that two variables pointing to the
>>>>>> instance
>>>>>> are updated at the same
>>>>>>
>>>>>> time --
>>>>>>
>>>>>> and I had been thinking of a
>>>>>> LockedEnvironment-style implementation
>>>>>>
>>>>>> where
>>>>>>
>>>>>> some operations were free ('copying') but
>>>>>> others weren't (subset,
>>>>>>
>>>>>> subset
>>>>>>
>>>>>> assign). But maybe there are some more
>>>>>> direct approaches...
>>>>>>
>>>>>>
>>>>>> My 2c: This is a situation where some
>>>>>> experimental data would be
>>>>>>
>>>>>> helpful.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Perhaps Tim or Kasper can share a largish
>>>>>> (unpublished) dataset and a
>>>>>> typical workflow with the Seattle folks. Even
>>>>>> that level of detail
>>>>>>
>>>>>> would
>>>>>>
>>>>>> give some sense of the scope of the problem.
>>>>>>
>>>>>>
>>>>>>
>>>>>> For clarification, I did not mean large data,
>>>>>> although that would be
>>>>>> welcome. I meant data on computational
>>>>>> experiments
>>>>>> with different
>>>>>> approaches -- that is, performance statistics that
>>>>>> would indicate where
>>>>>> scalability fails, what steps have been taken to
>>>>>> recover it, and how
>>>>>>
>>>>>> costly
>>>>>>
>>>>>> those steps are in terms of complexity,
>>>>>> reliability,
>>>>>> maintainability,
>>>>>>
>>>>>> risk
>>>>>>
>>>>>> at the user/application end.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>>
>>>>>> Yes, for instance where in the interactive
>>>>>> use is time being spent? Is
>>>>>> it copying the assays, or validity, or
>>>>>> actually updating the row
>>>>>>
>>>>>> data? Is
>>>>>>
>>>>>> 500000 x 800 an appropriate scale to be
>>>>>> thinking about?
>>>>>>
>>>>>>
>>>>>> The main avenues for a developer seem
>>>>>> to be a) use environments or
>>>>>>
>>>>>> reference classes; there are some
>>>>>> costs
>>>>>> and we should understand
>>>>>>
>>>>>> them,
>>>>>>
>>>>>> and
>>>>>> b) use an out-of-memory approach like
>>>>>> rhdf5 or ff. Again there will
>>>>>>
>>>>>> be
>>>>>>
>>>>>> some costs. It should be relatively
>>>>>> easy to experiment with these.
>>>>>>
>>>>>> One
>>>>>>
>>>>>> thing I just learned about is
>>>>>> setValidity2 and disableValidity
>>>>>>
>>>>>> (defined
>>>>>>
>>>>>> in
>>>>>> IRanges IIRC) ... these allow you to
>>>>>> construct certain variations on
>>>>>> SummarizedExperiment with less
>>>>>> attention
>>>>>> to deeper infrastructure.
>>>>>>
>>>>>>
>>>>>> probably I can make better use of the
>>>>>> insights the IRanges guys have
>>>>>>
>>>>>> had
>>>>>>
>>>>>> in their careful development and
>>>>>> application
>>>>>> of validity methods,
>>>>>>
>>>>>> though I
>>>>>>
>>>>>> feel a bit like these are 'attractive
>>>>>> hazards' that tempt us to do
>>>>>>
>>>>>> unsafe
>>>>>>
>>>>>> things and then pay the price later. This
>>>>>> is
>>>>>> likely the first
>>>>>>
>>>>>> direction
>>>>>>
>>>>>> I'll explore.
>>>>>>
>>>>>> Exploring a little I already see that
>>>>>> there
>>>>>> are some pretty dumb
>>>>>>
>>>>>> things
>>>>>>
>>>>>> being done in assignment.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>> whereby the assays/libraries for a
>>>>>> given
>>>>>> study subject are all
>>>>>>
>>>>>> pointed
>>>>>>
>>>>>> to
>>>>>> as SEs (i.e. RNAseq, BSseq,
>>>>>> expression/methylation arrays,
>>>>>> CNV/SNP
>>>>>> arrays,
>>>>>> WGS or exomic DNAseq) and the
>>>>>> column
>>>>>> (phenotype) data can avoid
>>>>>>
>>>>>> being
>>>>>>
>>>>>> subject to these constraints.
>>>>>> Truth
>>>>>> be told I *still* want to do
>>>>>>
>>>>>> that
>>>>>>
>>>>>> because, most of the time, updates
>>>>>> to the latter are independent of,
>>>>>> and
>>>>>> happen subsequently to, loading
>>>>>> the
>>>>>> former.
>>>>>>
>>>>>> Suggestions would be welcome,
>>>>>> because other than these minor
>>>>>>
>>>>>> niggles,
>>>>>>
>>>>>> the
>>>>>> SummarizedExperiment class is
>>>>>> almost
>>>>>> perfect for many tasks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 29, 2012 at 9:57 PM,
>>>>>> Tim
>>>>>> Triche, Jr. <
>>>>>>
>>>>>> tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>>>
>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> assigning new colData columns,
>>>>>> or
>>>>>> overwriting old ones, in a
>>>>>>
>>>>>> sizable
>>>>>>
>>>>>> (say
>>>>>> 500000 row x 800 column) SE is
>>>>>> nauseatingly slow.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> There has to be a better way
>>>>>> --
>>>>>> I'm willing to write it if
>>>>>> someone
>>>>>>
>>>>>> can
>>>>>>
>>>>>> point out an obvious way to do
>>>>>> it
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 29, 2012 at 9:52
>>>>>> PM,
>>>>>> Kasper Daniel Hansen <
>>>>>> kasperdanielhansen at gmail.com
>>>>>>
>>>>>> <mailto:kasperdanielhansen at gmail.com>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at
>>>>>> 12:44
>>>>>> AM, Martin Morgan <
>>>>>>
>>>>>> mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> On 08/29/2012 06:46
>>>>>> PM,
>>>>>> Kasper Daniel Hansen
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> There is a lot of
>>>>>> good stuff to say
>>>>>> about
>>>>>>
>>>>>> SummarizedExperiments,
>>>>>> and
>>>>>> from a certain
>>>>>> point
>>>>>> of view I have a
>>>>>> parallel
>>>>>> implementation in
>>>>>>
>>>>>> bsseq
>>>>>>
>>>>>>
>>>>>> (and there is also one in
>>>>>> genoset).
>>>>>>
>>>>>>
>>>>>> However, I really
>>>>>> like having the
>>>>>> assayData inside
>>>>>> an
>>>>>>
>>>>>> environment.
>>>>>>
>>>>>> This helps some on
>>>>>> memory and -
>>>>>> equally
>>>>>> important - speed
>>>>>> at
>>>>>> the
>>>>>> command line. I
>>>>>> certainly need to
>>>>>> very heavily
>>>>>> consider using
>>>>>>
>>>>>> an
>>>>>>
>>>>>> environment in
>>>>>> bsseq.
>>>>>>
>>>>>> After some
>>>>>> discussion with
>>>>>> Tim
>>>>>> (Triche) we have
>>>>>> agreed that
>>>>>> something
>>>>>> like
>>>>>>
>>>>>> SummarizedExperiments
>>>>>> is
>>>>>> the way to go at
>>>>>> least for the
>>>>>> methylation
>>>>>> arrays.
>>>>>> We need to be
>>>>>> able
>>>>>> to easily handle
>>>>>> 1000s
>>>>>>
>>>>>> of
>>>>>>
>>>>>> samples.
>>>>>>
>>>>>> What is the chance
>>>>>> that we can get
>>>>>> the
>>>>>> option of having
>>>>>> the
>>>>>> assayData
>>>>>> inside an
>>>>>> environment,
>>>>>> perhaps
>>>>>> by
>>>>>> Making a
>>>>>> class
>>>>>> that is an
>>>>>> environment and
>>>>>> inherits from
>>>>>>
>>>>>> SimpleList.
>>>>>>
>>>>>>
>>>>>> Using a classUnion
>>>>>> between the existing class of
>>>>>> the assayData
>>>>>>
>>>>>> and
>>>>>> an environment.
>>>>>> Third option
>>>>>> that is probably
>>>>>> better than the
>>>>>> proceeding
>>>>>>
>>>>>> two,
>>>>>>
>>>>>> but
>>>>>> which I cannot
>>>>>> come
>>>>>> up with right now.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Probably something can
>>>>>> /
>>>>>> will be done. I guess
>>>>>> the slowness
>>>>>>
>>>>>> you're
>>>>>>
>>>>>>
>>>>>> talking
>>>>>>
>>>>>> about is when rowData
>>>>>> /
>>>>>> colData columns are
>>>>>> manipulated; any
>>>>>>
>>>>>> kind of
>>>>>>
>>>>>> subsetting would mean
>>>>>> a
>>>>>> 'deep' copy. Martin
>>>>>>
>>>>>>
>>>>>> Yes, for example
>>>>>> manipulating colData -
>>>>>> something that
>>>>>>
>>>>>> conceptually
>>>>>>
>>>>>> should be quick and easy.
>>>>>> Of course, this will not
>>>>>> affect any
>>>>>>
>>>>>> real
>>>>>>
>>>>>> computation on the
>>>>>> assayData
>>>>>> matrices, but it will make
>>>>>> life at
>>>>>>
>>>>>> the
>>>>>>
>>>>>> command prompt more
>>>>>> pleasant.
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> This would - in my
>>>>>> opinion - be very
>>>>>> nice and
>>>>>> worthwhile.
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>>
>>>>>> ________________________________**_________________
>>>>>>
>>>>>> Bioc-devel at r-project.org
>>>>>>
>>>>>> <mailto:Bioc-devel at r-project.org>
>>>>>> mailing list
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Computational Biology
>>>>>> /
>>>>>> Fred Hutchinson Cancer
>>>>>> Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle,
>>>>>> WA
>>>>>> 98109
>>>>>>
>>>>>> Location: Arnold
>>>>>> Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>>>>
>>>>>> <tel:%28206%29%20667-2793>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ________________________________**_________________
>>>>>> Bioc-devel at r-project.org
>>>>>>
>>>>>> <mailto:Bioc-devel at r-project.org>
>>>>>> mailing list
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *A model is a lie that helps
>>>>>> you
>>>>>> see the truth.*
>>>>>> *
>>>>>> *
>>>>>> Howard Skipper<
>>>>>>
>>>>>>
>>>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>>>
>>>>>>
>>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>>
>>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>>
>>>>>>
>>>>>> **>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *A model is a lie that helps you
>>>>>> see
>>>>>> the truth.*
>>>>>> *
>>>>>> *
>>>>>> Howard Skipper<
>>>>>>
>>>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>>>
>>>>>>
>>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>>
>>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>>
>>>>>>
>>>>>> **>
>>>>>>
>>>>>> [[alternative HTML
>>>>>> version deleted]]
>>>>>>
>>>>>>
>>>>>> ________________________________**_________________
>>>>>> Bioc-devel at r-project.org
>>>>>> <mailto:Bioc-devel at r-project.org>
>>>>>> mailing list
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version
>>>>>> deleted]]
>>>>>>
>>>>>>
>>>>>> ________________________________**_________________
>>>>>> Bioc-devel at r-project.org
>>>>>> <mailto:Bioc-devel at r-project.org>
>>>>>> mailing list
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Computational Biology / Fred Hutchinson
>>>>>> Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793
>>>>>> <tel:%28206%29%20667-2793>
>>>>>>
>>>>>>
>>>>>> ________________________________**_________________
>>>>>> Bioc-devel at r-project.org
>>>>>> <mailto:Bioc-devel at r-project.org> mailing
>>>>>> list
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>>
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *A model is a lie that helps you see the truth.*
>>>>>> *
>>>>>> *
>>>>>> Howard Skipper<
>>>>>>
>>>>>>
>>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>>
>>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _________________________________________________
>>>>>> Bioc-devel at r-project.org
>>>>>> <mailto:Bioc-devel at r-project.org> mailing list
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _________________________________________________
>>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>>> mailing list
>>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>>
>>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N.
>>>>>> PO Box 19024 Seattle, WA 98109
>>>>>>
>>>>>> Location: Arnold Building M1 B861
>>>>>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> /A model is a lie that helps you see the truth./
>>>>>> /
>>>>>> /
>>>>>> Howard Skipper
>>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: Arnold Building M1 B861
>>>>> Phone: (206) 667-2793
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> A model is a lie that helps you see the truth.
>>>>
>>>> Howard Skipper
>>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list