[Bioc-devel] SummarizedExperiments
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Wed Sep 12 21:15:26 CEST 2012
On Wed, Sep 12, 2012 at 2:23 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 09/06/2012 02:26 PM, Kasper Daniel Hansen wrote:
>>
>> Alltogether, this looks great. I am still implementing/testing, but
>> seems pretty nice.
>>
>> Here is a number of convenience methods I believe should be
>> implemented. Some of them comes from eSet and some of them comes from
>> my own work in bsseq, but I have used all of them extensively in
>> various packages.
>>
>> sampleNames, featureNames, pData
>> granges, start, end, width, strand, seqnames, seqlengths, seqlevels
>> findOverlaps, subsetByOverlaps
>
>
> GenomicRanges 1.9.65 in Bioc devel makes the following operations on
> SummarizedExperiment work on the underlying rowData; I particularly like
> subsetByOverslaps() for selecting, e.g., SummarizedExperiment rows that
> overlap features of interest, and seqlevels(x, force=TRUE) = paste("chr",
> 1:5), for instance, for selecting sequences of interest. See the section
> 'GRanges compatibility (rowData access)' on ?SummarizedExperiment.
Nice list, this is useful. Now I guess I need to update bsseq.
seqselect is nice, but also destructive. In my own work I have found
it quite useful to be able to subset on a whole chromosome, without
destroying the old object, like
function(sset, "chr1")
I know this can be done by a long subsetByOverlaps, but - like
seqselect - it is nice to have a quicker way of doing it.
I don't see any mention of
sset[gr,]
Is this planned or was it decided to drop this?
One thing I have in my package that I find indispensable is combine
and (my own) combineList. The later for combining > 2 objects, which
has a lot of possibilities for speed up especially if (very common)
all the objects have the same rowData, as opposed to Reduce(combine,
LIST).. Usecase: you need to add additional samples to your
SummarizedExperiment.
My code (which may utilize that width(object) == 1 is in bsseq/R/BSseq-class.R
Kasper
>
> Martin
>
> compare, countOverlaps, coverage, disjointBins, distance, distanceToNearest,
> duplicated, end, end<-, findOverlaps, flank, follow, granges, isDisjoint,
> match, mcols, mcols<-, narrow, nearest, order, precede, ranges, ranges<-,
> rank, resize, restrict, seqinfo, seqinfo<-, seqnames, shift, sort, start,
> start<-, strand, strand<-, width, width<-.
>
>
>>
>> 'granges' is a method already defined in GenomicRanges, I have found
>> it very convenient. It is not clear to me why it is all lowercase
>>
>> What I have done in bsseq for the later two lines of methods, is to
>> have a class 'hasGRanges' that designate any object with a GRanges.
>> Then implement the granges() method and the rest goes from there. Not
>> sure whether that makes sense in this context, but in general I have
>> found it very convenient for my own work. Code for all of this is in
>> bsseq/R/hasGRanges.R. I have two classes in the package inheriting
>> from hasGRanges. Note that the classes in bsseq have not been updated
>> to build on the new SummarizedExperiment (yet).
>>
>> Kasper
>>
>>
>> On Wed, Sep 5, 2012 at 1:31 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>> wrote:
>>>
>>> It seems to work with a Matrix in the assays:
>>>
>>> R> packageVersion('GenomicRanges')
>>> [1] ‘1.9.61’
>>>
>>> R> SE <- SummarizedExperiment(assays=wts, rowData=wts.locs)
>>> Error in function (classes, fdef, mtable) :
>>> unable to find an inherited method for function
>>> "SummarizedExperiment",
>>> for signature "dgCMatrix"
>>>
>>> R> SE <- SummarizedExperiment(assays=SimpleList(weights=wts),
>>> rowData=wts.locs)
>>> dimnames(.) <- NULL: translated to
>>> dimnames(.) <- list(NULL,NULL) <==> unname(.)
>>>
>>> R> show(SE)
>>> class: SummarizedExperiment
>>> dim: 35855 35855
>>> exptData(0):
>>> assays(1): weights
>>> rownames(35855): feat1 feat2 ... feat35854 feat35855
>>> rowData metadata column names(10): egid signed ... DHS_DGF chromHMM_state
>>> colnames(35855): feat1 feat2 ... feat35854 feat35855
>>> colData names(0):
>>>
>>> R> class(assays(SE, 'weights'))
>>> [1] "SimpleList"
>>> attr(,"package")
>>> [1] "IRanges"
>>>
>>> R> class(assays(SE)$weights)
>>> [1] "dgCMatrix"
>>> attr(,"package")
>>> [1] "Matrix"
>>>
>>>
>>> Very cool! Albeit a tad disorienting. I was going to patch this up and
>>> then realized, hey, if someone's going to stuff a Matrix into their
>>> assays,
>>> they ought to give it a name and know what's going on under the hood, at
>>> least for now. It seems to work correctly, too.
>>>
>>> Oh oh, I spoke too soon. If I split by chromosome, the assay falls out:
>>>
>>> R> head(assays(SE)$weights)
>>> 6 x 35855 sparse Matrix of class "dgCMatrix"
>>> [[ suppressing 82 column names ‘feat1’, ‘feat2’, ‘feat3’ ... ]]
>>>
>>> feat1 1 . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat2 . 1 . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat3 . . 1.0000000 0.5737488 . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat4 . . 0.5737488 1.0000000 . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat5 . . . . 1.00000000 0.04300373 . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>> feat6 . . . . 0.04300373 1.00000000 . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>>> .
>>> . . . . . . . . . . . .
>>>
>>> feat1 . . . . . . . . . . . . . . ......
>>> feat2 . . . . . . . . . . . . . . ......
>>> feat3 . . . . . . . . . . . . . . ......
>>> feat4 . . . . . . . . . . . . . . ......
>>> feat5 . . . . . . . . . . . . . . ......
>>> feat6 . . . . . . . . . . . . . . ......
>>>
>>> .....suppressing columns in show(); maybe adjust 'options(max.print=
>>> *)'
>>> ..............................
>>> R> SE.by.chr <- split(SE, as.vector(seqnames(SE)))
>>> R> head(assays(SE.by.chr)$weights)
>>> Error in head(assays(SE.by.chr)$weights) :
>>> error in evaluating the argument 'x' in selecting a method for
>>> function
>>> 'head': Error in function (classes, fdef, mtable) :
>>> unable to find an inherited method for function "assays", for
>>> signature
>>> "list"
>>>
>>> Ideas? I would love to do something useful here instead of just fussing.
>>> It's really a great structure.
>>>
>>> I love the use of a reference class to reference the internal pieces of
>>> an
>>> object, it's such a clever hack. Thanks much for adding it.
>>>
>>> --t
>>>
>>>
>>>
>>> On Tue, Sep 4, 2012 at 7:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>>
>>>>
>>>> On 09/04/2012 07:02 PM, Tim Triche, Jr. wrote:
>>>>>
>>>>>
>>>>> This is really cool.
>>>>>
>>>>> R> head(lsos(), 1)
>>>>> Type Size Rows Columns
>>>>> LAML.RNAseq SummarizedExperiment 38666528 23529 173
>>>>> R> LAML.RNAseq.updated <- updateObject(LAML.RNAseq)
>>>>> R> head(lsos(), 2)
>>>>> Type Size Rows Columns
>>>>> LAML.RNAseq SummarizedExperiment 38666528 23529 173
>>>>> LAML.RNAseq.updated SummarizedExperiment 6101488 23529 173
>>>>
>>>>
>>>>
>>>> I think either the sizes are misleading (missing the content of the
>>>> environment that the reference classes contain, probably
>>>> LAML.RNAseq.updated at assays$data has size 38666528 - 6101488) or the data
>>>> has
>>>> not been updated correctly :\. Nonetheless manipulation of especially
>>>> non-assay but also assay data should be faster.
>>>>
>>>> I added a brief section (1.9.60) to ?SummarizedExperiment describing
>>>> what
>>>> was done. Also, I tried (w/out actually testing...) to add support for
>>>> Matrix instances.
>>>>
>>>> As a miscellaneous 'tip', when working with large data I find it useful
>>>> to
>>>> start R with --min-vsize=2048M --min-nsize=20M. These are documented in
>>>> RShowDoc("R-intro") as 'for expert use only' and influence how much
>>>> memory R
>>>> allocates for vectors ('vsize'; vsize can be very large) and for
>>>> S-expressions ('nsize' 50000000) and help get up to large memory
>>>> allocations
>>>> without innumerable garbage collections. Of course having big memory is
>>>> of
>>>> primary importance.
>>>>
>>>> Martin
>>>>
>>>>> R> str(LAML.RNAseq.updated at assays)
>>>>> Reference class 'ShallowSimpleListAssays' [package "GenomicRanges"]
>>>>> with
>>>>> 1 fields
>>>>> $ data:Formal class 'SimpleList' [package "IRanges"] with 4 slots
>>>>> .. ..@ listData :List of 1
>>>>> .. .. ..$ rpm: num [1:23529, 1:173] 0 0 8.12 10.84 27.73 ...
>>>>> .. ..@ elementType : chr "ANY"
>>>>> .. ..@ elementMetadata: NULL
>>>>> .. ..@ metadata : list()
>>>>> and 11 methods,R>
>>>>> R> packageVersion('GenomicRanges')
>>>>> [1] ‘1.9.59’
>>>>>
>>>>> A slightly bigger one:
>>>>>
>>>>> R> head(lsos(),2)
>>>>> Type Size Rows Columns
>>>>> LAML SummarizedExperiment 1950606472 485577 192
>>>>> LAML.updated SummarizedExperiment 458912760 485577 192
>>>>> R> system.time(LAML.updated$foo <- rep('bar', 192))
>>>>> user system elapsed
>>>>> 0.132 0.116 0.248
>>>>> R> system.time(LAML$foo <- rep('bar', 192))
>>>>> user system elapsed
>>>>> 2.728 2.144 7.519
>>>>>
>>>>>
>>>>> That is a really clever hack you added. I think I will have to figure
>>>>> out how it works :-)
>>>>>
>>>>> Thanks!!!
>>>>>
>>>>>
>>>>> On Sun, Sep 2, 2012 at 2:47 PM, Martin Morgan <mtmorgan at fhcrc.org
>>>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>>
>>>>> On 08/30/2012 08:47 AM, Vincent Carey wrote:
>>>>>
>>>>> I am all in favor of the dialogue and the aim. I reproduced
>>>>> your first
>>>>> timing, and note
>>>>>
>>>>> unix.time(ss2 at colData$tx <- 1:1000)
>>>>>
>>>>> user system elapsed
>>>>> 5.963 4.520 10.483
>>>>>
>>>>> unix.time(colData(ss2)$tx2 <- 1:1000)
>>>>>
>>>>> user system elapsed
>>>>> 17.937 13.074 31.016
>>>>>
>>>>> BTW did you really mean to put the meth/unmeth in exptData, or
>>>>> should it be
>>>>> in assays?
>>>>>
>>>>> From an experimentation perspective, I would want to redo
>>>>> these
>>>>>
>>>>> computations above with an environment holding the assay data
>>>>> and see how
>>>>> it changes. I suppose one can infer from first principles
>>>>> that
>>>>>
>>>>> ss4 at assays[[1]]$meth[1:4,1:4]
>>>>>
>>>>> [,1] [,2] [,3] [,4]
>>>>> [1,] 0.2 0.2 0.2 0.2
>>>>> [2,] 0.2 0.2 0.2 0.2
>>>>> [3,] 0.2 0.2 0.2 0.2
>>>>> [4,] 0.2 0.2 0.2 0.2
>>>>>
>>>>> ss4 at assays[[2]]$unmeth[1:4,1:__4]
>>>>>
>>>>>
>>>>> [,1] [,2] [,3] [,4]
>>>>> [1,] 0.2 0.2 0.2 0.2
>>>>> [2,] 0.2 0.2 0.2 0.2
>>>>> [3,] 0.2 0.2 0.2 0.2
>>>>> [4,] 0.2 0.2 0.2 0.2
>>>>>
>>>>> unix.time(colData(ss4)$tx2 <- 1:1000)
>>>>>
>>>>> user system elapsed
>>>>> 0.010 0.002 0.012
>>>>>
>>>>> where ss4 has a SimpleList of environments holding the assay
>>>>> data.
>>>>>
>>>>> ss4
>>>>>
>>>>> class: SummarizedExperiment
>>>>> dim: 450000 1000
>>>>> exptData(0):
>>>>> assays(2): '' ''
>>>>> rownames: NULL
>>>>> rowData values names(0):
>>>>> colnames: NULL
>>>>> colData names(4): sampleID treatment tx tx2
>>>>>
>>>>> ss4 at assays[[1]]
>>>>>
>>>>> <environment: 0x1f4e0ba0>
>>>>>
>>>>> This can't be attempted without disabling validity checking,
>>>>> and
>>>>> the stack
>>>>> trace from the attempt is instructive.
>>>>>
>>>>> I don't think it will be too difficult to establish a
>>>>> customized
>>>>> SummarizedExperiment that permits this, using the
>>>>> lockedEnvironment
>>>>> approach of ExpressionSets. I also do not know the down sides
>>>>> but it seems
>>>>> you/we could get some mileage on the approach and see.
>>>>>
>>>>>
>>>>> GenomicRanges v. 1.9.59 has some preliminary changes that address
>>>>> the performance issue. Serialized (saved) objects 'x' need to be
>>>>> updated with updateObject(x) (and be careful if these are your
>>>>> only
>>>>> instances of some hard work). There are some loose ends to tidy
>>>>> up.
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 30, 2012 at 10:53 AM, Kasper Daniel Hansen <
>>>>> kasperdanielhansen at gmail.com
>>>>> <mailto:kasperdanielhansen at gmail.com>> wrote:
>>>>>
>>>>> My perspective (and the reason for me email) is pretty
>>>>> simple.
>>>>>
>>>>> First off, let me say that we will always be working on
>>>>> trading off
>>>>> memory and IO and development time and all the other
>>>>> stuff.
>>>>> I mean,
>>>>> clearly the environment stuff for ExpressionSets did not
>>>>> help all
>>>>> people, because we have xps, aroma.affymetrix and oligo
>>>>> all
>>>>> using some
>>>>> kind of file based backend.
>>>>>
>>>>> I am not trying to solve something "big". Putting
>>>>> methylation data
>>>>> inside an environment is not suddenly going to make my
>>>>> algorithms go
>>>>> _significantly_ faster or the memory usage to go
>>>>> _significantly_ down.
>>>>> But it makes life faster when you explore data - I
>>>>> believe.
>>>>>
>>>>> Now for two realistic examples (one minfi, one bsseq) that
>>>>> I
>>>>> encounter
>>>>> all the time:
>>>>> library(GenomicRanges)
>>>>> nRows <- 450000
>>>>> nCols <- 1000
>>>>> SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>> rep("chr1",
>>>>> nRows),
>>>>> ranges = IRanges(1:nRows,
>>>>> width = 1)),
>>>>> colData =
>>>>> DataFrame(sampleID =
>>>>> paste0("person", 1:nCols)),
>>>>> exptData = SimpleList(meth
>>>>> =
>>>>> matrix(0.2,
>>>>> ncol = nCols, nrow = nRows),
>>>>> unmeth = matrix(0.2, ncol =
>>>>> nCols, nrow =
>>>>> nRows)))
>>>>> print(object.size(SSet), units = "auto") # 6.7GB
>>>>> system.time({
>>>>> colData(SSet)$treatment = rep(c("a", "b"), nCols /
>>>>> 2)
>>>>> }) # around 26 sec
>>>>>
>>>>>
>>>>> nRows <- 28000000
>>>>> nCols <- 10
>>>>> SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>>> rep("chr1",
>>>>> nRows),
>>>>> ranges = IRanges(1:nRows,
>>>>> width = 1)),
>>>>> colData =
>>>>> DataFrame(sampleID =
>>>>> paste0("person", 1:nCols)),
>>>>> exptData = SimpleList(meth
>>>>> =
>>>>> matrix(0.2,
>>>>> ncol = nCols, nrow = nRows),
>>>>> unmeth = matrix(0.2, ncol =
>>>>> nCols, nrow =
>>>>> nRows)))
>>>>> print(object.size(SSet), units = "auto") # 4.4GB
>>>>> system.time({
>>>>> colData(SSet)$treatment = rep(c("a", "b"), nCols /
>>>>> 2)
>>>>> }) # around 20 sec
>>>>>
>>>>> This is of course not that slow, especially not when
>>>>> compared to the
>>>>> massive amount of computation I had to do in order to
>>>>> arrive
>>>>> at the
>>>>> SSet in the first place. But the difference is that I
>>>>> staring at the
>>>>> screen when I work interactively, but I am out partying
>>>>> when
>>>>> I run my
>>>>> computations (overnight). And I do stuff like the above
>>>>> surprisingly
>>>>> many times after I have generated the objects.
>>>>>
>>>>> My perspective is that this was essentially "solved" with
>>>>> the
>>>>> ExpressionSet class, and I feel we are regressing a bit.
>>>>> I
>>>>> also
>>>>> believe that it should be "easy" to re-use some of the
>>>>> ideas/code from
>>>>> ExpressionSet and that it is worthwhile to do so, for
>>>>> something that
>>>>> may very well be an extremely important core class.
>>>>> SummarizedExperiment is such a nice class that I
>>>>> essentially
>>>>> have a
>>>>> parallel implementation in bsseq and Pete has one in
>>>>> genoset.
>>>>>
>>>>> So essentially, I was interested in fixing a pretty small
>>>>> problem, but
>>>>> something that I am a bit irritated about. And I feel we
>>>>> almost have
>>>>> a ready to use solution.
>>>>>
>>>>> The reason I ask about it here, and not just do it myself,
>>>>> is twofold
>>>>> (1) it is tricky to get the environment right, and I do
>>>>> not
>>>>> have
>>>>> enough experience on the environment part of
>>>>> ExpressionSet,
>>>>> that I
>>>>> feel comfortable doing it
>>>>> (2) I really think it would be beneficial to the
>>>>> community,
>>>>> and
>>>>> experience tells us, that what happens in these core
>>>>> infrastructure
>>>>> packages are much more likely to be widely used.
>>>>>
>>>>> While I think that it may be worthwhile for people to play
>>>>> with
>>>>> reference classes, I think that going down that route is
>>>>> unexplored
>>>>> and may or may not be worth it. There is something to be
>>>>> said for
>>>>> building on something we have experience with. A
>>>>> reference
>>>>> class idea
>>>>> can be developed in parallel, if people are willing to
>>>>> commit the
>>>>> time.
>>>>>
>>>>> Kasper
>>>>>
>>>>>
>>>>> On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr.
>>>>> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>>>>
>>>>> wrote:
>>>>>
>>>>> I am interested in how best to benchmark such
>>>>> operations (and another
>>>>> thought that I recently had, "why can't I put a Doug
>>>>> Bates style 'Matrix'
>>>>> in the SimpleList of assay data)? Kasper is the
>>>>> person
>>>>> who convinced me
>>>>> that ff, rhdf5, etc.should be a last resort, since RAM
>>>>> is cheap.
>>>>>
>>>>> For large data, the TCGA BRCA samples are a hair under
>>>>> 800 (last I
>>>>>
>>>>> looked)
>>>>>
>>>>> 450k arrays with matching RNAseq and DNAseq, and
>>>>> that's
>>>>> not even a big
>>>>> study by breast cancer standards (IIRC, Gordon Mills
>>>>> was
>>>>> rounding up 5000
>>>>> tumors) Both minfi and methylumi can work off of the
>>>>> raw
>>>>> data in these
>>>>> studies as a "practical" example, but when it really
>>>>> gets interesting is
>>>>> when 3-4 ragged assays exist for each patient.
>>>>>
>>>>> Exon-wise (or splice-graph-wise) RNAseq data with
>>>>> splice
>>>>> junction read
>>>>> counts is another place to find big SE-centric data
>>>>> with
>>>>> ambiguities.
>>>>>
>>>>> One
>>>>>
>>>>> of the reasons I keep coming back to the idea of
>>>>> subsetting an SE by a
>>>>> GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is
>>>>> its speed. In any
>>>>> event, after my obligations for the day are
>>>>> dispatched,
>>>>> I'll write up an
>>>>> example of doing this off of TCGA data, perhaps from
>>>>> colorectal (only
>>>>>
>>>>> about
>>>>>
>>>>> 400 subjects but at least it is published already).
>>>>> You
>>>>> are right, a
>>>>> single use case is often worth 1000 specifications.
>>>>>
>>>>> --t
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
>>>>> <stvjc at channing.harvard.edu
>>>>> <mailto:stvjc at channing.harvard.edu>>__wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis
>>>>> <sdavis2 at mail.nih.gov
>>>>> <mailto:sdavis2 at mail.nih.gov>>
>>>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan
>>>>> <mtmorgan at fhcrc.org
>>>>> <mailto:mtmorgan at fhcrc.org>
>>>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>> On 08/30/2012 04:42 AM, Vincent Carey
>>>>> wrote:
>>>>>
>>>>> On Thu, Aug 30, 2012 at 6:27 AM, Tim
>>>>> Triche, Jr. <
>>>>>
>>>>> tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>> nb. one of the reasons for the
>>>>> existence of the MergedDataSet
>>>>>
>>>>> class in
>>>>>
>>>>> regulatoR (to be submitted for
>>>>> review shortly) is that, while SEs
>>>>>
>>>>> are
>>>>>
>>>>> absolutely fantastic for managing
>>>>> data matrices that are stapled to
>>>>>
>>>>> a
>>>>>
>>>>> GRanges, what is less awesome is
>>>>> having a relatively light-weight
>>>>> DataFrame
>>>>> for phenotypic data that requires
>>>>> the entire memory footprint be
>>>>> recreated
>>>>> upon writing a new column into
>>>>> said
>>>>> DataFrame.
>>>>>
>>>>> If R5 classes didn't spook me a
>>>>> little, I would already have done
>>>>> something
>>>>>
>>>>>
>>>>> We don't often use this R5 terminology
>>>>> but I see Hadley has made an
>>>>> accessible document referring to
>>>>> reference classes in this way.
>>>>>
>>>>>
>>>>> To me the challenge is more conceptual --
>>>>> pass-by-reference and the
>>>>>
>>>>> way
>>>>>
>>>>> that two variables pointing to the
>>>>> instance
>>>>> are updated at the same
>>>>>
>>>>> time --
>>>>>
>>>>> and I had been thinking of a
>>>>> LockedEnvironment-style implementation
>>>>>
>>>>> where
>>>>>
>>>>> some operations were free ('copying') but
>>>>> others weren't (subset,
>>>>>
>>>>> subset
>>>>>
>>>>> assign). But maybe there are some more
>>>>> direct approaches...
>>>>>
>>>>>
>>>>> My 2c: This is a situation where some
>>>>> experimental data would be
>>>>>
>>>>> helpful.
>>>>>
>>>>>
>>>>>
>>>>> Perhaps Tim or Kasper can share a largish
>>>>> (unpublished) dataset and a
>>>>> typical workflow with the Seattle folks. Even
>>>>> that level of detail
>>>>>
>>>>> would
>>>>>
>>>>> give some sense of the scope of the problem.
>>>>>
>>>>>
>>>>>
>>>>> For clarification, I did not mean large data,
>>>>> although that would be
>>>>> welcome. I meant data on computational
>>>>> experiments
>>>>> with different
>>>>> approaches -- that is, performance statistics that
>>>>> would indicate where
>>>>> scalability fails, what steps have been taken to
>>>>> recover it, and how
>>>>>
>>>>> costly
>>>>>
>>>>> those steps are in terms of complexity,
>>>>> reliability,
>>>>> maintainability,
>>>>>
>>>>> risk
>>>>>
>>>>> at the user/application end.
>>>>>
>>>>>
>>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>> Yes, for instance where in the interactive
>>>>> use is time being spent? Is
>>>>> it copying the assays, or validity, or
>>>>> actually updating the row
>>>>>
>>>>> data? Is
>>>>>
>>>>> 500000 x 800 an appropriate scale to be
>>>>> thinking about?
>>>>>
>>>>>
>>>>> The main avenues for a developer seem
>>>>> to be a) use environments or
>>>>>
>>>>> reference classes; there are some
>>>>> costs
>>>>> and we should understand
>>>>>
>>>>> them,
>>>>>
>>>>> and
>>>>> b) use an out-of-memory approach like
>>>>> rhdf5 or ff. Again there will
>>>>>
>>>>> be
>>>>>
>>>>> some costs. It should be relatively
>>>>> easy to experiment with these.
>>>>>
>>>>> One
>>>>>
>>>>> thing I just learned about is
>>>>> setValidity2 and disableValidity
>>>>>
>>>>> (defined
>>>>>
>>>>> in
>>>>> IRanges IIRC) ... these allow you to
>>>>> construct certain variations on
>>>>> SummarizedExperiment with less
>>>>> attention
>>>>> to deeper infrastructure.
>>>>>
>>>>>
>>>>> probably I can make better use of the
>>>>> insights the IRanges guys have
>>>>>
>>>>> had
>>>>>
>>>>> in their careful development and
>>>>> application
>>>>> of validity methods,
>>>>>
>>>>> though I
>>>>>
>>>>> feel a bit like these are 'attractive
>>>>> hazards' that tempt us to do
>>>>>
>>>>> unsafe
>>>>>
>>>>> things and then pay the price later. This
>>>>> is
>>>>> likely the first
>>>>>
>>>>> direction
>>>>>
>>>>> I'll explore.
>>>>>
>>>>> Exploring a little I already see that
>>>>> there
>>>>> are some pretty dumb
>>>>>
>>>>> things
>>>>>
>>>>> being done in assignment.
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> whereby the assays/libraries for a
>>>>> given
>>>>> study subject are all
>>>>>
>>>>> pointed
>>>>>
>>>>> to
>>>>> as SEs (i.e. RNAseq, BSseq,
>>>>> expression/methylation arrays,
>>>>> CNV/SNP
>>>>> arrays,
>>>>> WGS or exomic DNAseq) and the
>>>>> column
>>>>> (phenotype) data can avoid
>>>>>
>>>>> being
>>>>>
>>>>> subject to these constraints.
>>>>> Truth
>>>>> be told I *still* want to do
>>>>>
>>>>> that
>>>>>
>>>>> because, most of the time, updates
>>>>> to the latter are independent of,
>>>>> and
>>>>> happen subsequently to, loading
>>>>> the
>>>>> former.
>>>>>
>>>>> Suggestions would be welcome,
>>>>> because other than these minor
>>>>>
>>>>> niggles,
>>>>>
>>>>> the
>>>>> SummarizedExperiment class is
>>>>> almost
>>>>> perfect for many tasks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 29, 2012 at 9:57 PM,
>>>>> Tim
>>>>> Triche, Jr. <
>>>>>
>>>>> tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>>
>>>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>> assigning new colData columns,
>>>>> or
>>>>> overwriting old ones, in a
>>>>>
>>>>> sizable
>>>>>
>>>>> (say
>>>>> 500000 row x 800 column) SE is
>>>>> nauseatingly slow.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> There has to be a better way
>>>>> --
>>>>> I'm willing to write it if
>>>>> someone
>>>>>
>>>>> can
>>>>>
>>>>> point out an obvious way to do
>>>>> it
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 29, 2012 at 9:52
>>>>> PM,
>>>>> Kasper Daniel Hansen <
>>>>> kasperdanielhansen at gmail.com
>>>>>
>>>>> <mailto:kasperdanielhansen at gmail.com>>
>>>>>
>>>>> wrote:
>>>>>
>>>>> On Thu, Aug 30, 2012 at
>>>>> 12:44
>>>>> AM, Martin Morgan <
>>>>>
>>>>> mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>>>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>> On 08/29/2012 06:46
>>>>> PM,
>>>>> Kasper Daniel Hansen
>>>>> wrote:
>>>>>
>>>>>
>>>>> There is a lot of
>>>>> good stuff to say
>>>>> about
>>>>>
>>>>> SummarizedExperiments,
>>>>> and
>>>>> from a certain
>>>>> point
>>>>> of view I have a
>>>>> parallel
>>>>> implementation in
>>>>>
>>>>> bsseq
>>>>>
>>>>>
>>>>> (and there is also one in
>>>>> genoset).
>>>>>
>>>>>
>>>>> However, I really
>>>>> like having the
>>>>> assayData inside
>>>>> an
>>>>>
>>>>> environment.
>>>>>
>>>>> This helps some on
>>>>> memory and -
>>>>> equally
>>>>> important - speed
>>>>> at
>>>>> the
>>>>> command line. I
>>>>> certainly need to
>>>>> very heavily
>>>>> consider using
>>>>>
>>>>> an
>>>>>
>>>>> environment in
>>>>> bsseq.
>>>>>
>>>>> After some
>>>>> discussion with
>>>>> Tim
>>>>> (Triche) we have
>>>>> agreed that
>>>>> something
>>>>> like
>>>>>
>>>>> SummarizedExperiments
>>>>> is
>>>>> the way to go at
>>>>> least for the
>>>>> methylation
>>>>> arrays.
>>>>> We need to be
>>>>> able
>>>>> to easily handle
>>>>> 1000s
>>>>>
>>>>> of
>>>>>
>>>>> samples.
>>>>>
>>>>> What is the chance
>>>>> that we can get
>>>>> the
>>>>> option of having
>>>>> the
>>>>> assayData
>>>>> inside an
>>>>> environment,
>>>>> perhaps
>>>>> by
>>>>> Making a
>>>>> class
>>>>> that is an
>>>>> environment and
>>>>> inherits from
>>>>>
>>>>> SimpleList.
>>>>>
>>>>>
>>>>> Using a classUnion
>>>>> between the existing class of
>>>>> the assayData
>>>>>
>>>>> and
>>>>> an environment.
>>>>> Third option
>>>>> that is probably
>>>>> better than the
>>>>> proceeding
>>>>>
>>>>> two,
>>>>>
>>>>> but
>>>>> which I cannot
>>>>> come
>>>>> up with right now.
>>>>>
>>>>>
>>>>>
>>>>> Probably something can
>>>>> /
>>>>> will be done. I guess
>>>>> the slowness
>>>>>
>>>>> you're
>>>>>
>>>>>
>>>>> talking
>>>>>
>>>>> about is when rowData
>>>>> /
>>>>> colData columns are
>>>>> manipulated; any
>>>>>
>>>>> kind of
>>>>>
>>>>> subsetting would mean
>>>>> a
>>>>> 'deep' copy. Martin
>>>>>
>>>>>
>>>>> Yes, for example
>>>>> manipulating colData -
>>>>> something that
>>>>>
>>>>> conceptually
>>>>>
>>>>> should be quick and easy.
>>>>> Of course, this will not
>>>>> affect any
>>>>>
>>>>> real
>>>>>
>>>>> computation on the
>>>>> assayData
>>>>> matrices, but it will make
>>>>> life at
>>>>>
>>>>> the
>>>>>
>>>>> command prompt more
>>>>> pleasant.
>>>>>
>>>>> Kasper
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This would - in my
>>>>> opinion - be very
>>>>> nice and
>>>>> worthwhile.
>>>>>
>>>>> Kasper
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>>
>>>>> Bioc-devel at r-project.org
>>>>>
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Computational Biology
>>>>> /
>>>>> Fred Hutchinson Cancer
>>>>> Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle,
>>>>> WA
>>>>> 98109
>>>>>
>>>>> Location: Arnold
>>>>> Building M1 B861
>>>>> Phone: (206) 667-2793
>>>>>
>>>>> <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>> Bioc-devel at r-project.org
>>>>>
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *A model is a lie that helps
>>>>> you
>>>>> see the truth.*
>>>>> *
>>>>> *
>>>>> Howard Skipper<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>> **>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *A model is a lie that helps you
>>>>> see
>>>>> the truth.*
>>>>> *
>>>>> *
>>>>> Howard Skipper<
>>>>>
>>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>> **>
>>>>>
>>>>> [[alternative HTML
>>>>> version deleted]]
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>> Bioc-devel at r-project.org
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>> [[alternative HTML version
>>>>> deleted]]
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>> Bioc-devel at r-project.org
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Computational Biology / Fred Hutchinson
>>>>> Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: Arnold Building M1 B861
>>>>> Phone: (206) 667-2793
>>>>> <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>> ________________________________**_________________
>>>>> Bioc-devel at r-project.org
>>>>> <mailto:Bioc-devel at r-project.org> mailing
>>>>> list
>>>>>
>>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>>
>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *A model is a lie that helps you see the truth.*
>>>>> *
>>>>> *
>>>>> Howard Skipper<
>>>>>
>>>>>
>>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>>
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _________________________________________________
>>>>> Bioc-devel at r-project.org
>>>>> <mailto:Bioc-devel at r-project.org> mailing list
>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _________________________________________________
>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>>> mailing list
>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>>
>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: Arnold Building M1 B861
>>>>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> /A model is a lie that helps you see the truth./
>>>>> /
>>>>> /
>>>>> Howard Skipper
>>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>
>>>
>>>
>>>
>>>
>>> --
>>> A model is a lie that helps you see the truth.
>>>
>>> Howard Skipper
>>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
More information about the Bioc-devel
mailing list