[Bioc-devel] SummarizedExperiments
Martin Morgan
mtmorgan at fhcrc.org
Wed Sep 12 20:23:57 CEST 2012
On 09/06/2012 02:26 PM, Kasper Daniel Hansen wrote:
> Alltogether, this looks great. I am still implementing/testing, but
> seems pretty nice.
>
> Here is a number of convenience methods I believe should be
> implemented. Some of them comes from eSet and some of them comes from
> my own work in bsseq, but I have used all of them extensively in
> various packages.
>
> sampleNames, featureNames, pData
> granges, start, end, width, strand, seqnames, seqlengths, seqlevels
> findOverlaps, subsetByOverlaps
GenomicRanges 1.9.65 in Bioc devel makes the following operations on
SummarizedExperiment work on the underlying rowData; I particularly like
subsetByOverslaps() for selecting, e.g., SummarizedExperiment rows that
overlap features of interest, and seqlevels(x, force=TRUE) =
paste("chr", 1:5), for instance, for selecting sequences of interest.
See the section 'GRanges compatibility (rowData access)' on
?SummarizedExperiment.
Martin
compare, countOverlaps, coverage, disjointBins, distance,
distanceToNearest, duplicated, end, end<-, findOverlaps, flank, follow,
granges, isDisjoint, match, mcols, mcols<-, narrow, nearest, order,
precede, ranges, ranges<-, rank, resize, restrict, seqinfo, seqinfo<-,
seqnames, shift, sort, start, start<-, strand, strand<-, width, width<-.
>
> 'granges' is a method already defined in GenomicRanges, I have found
> it very convenient. It is not clear to me why it is all lowercase
>
> What I have done in bsseq for the later two lines of methods, is to
> have a class 'hasGRanges' that designate any object with a GRanges.
> Then implement the granges() method and the rest goes from there. Not
> sure whether that makes sense in this context, but in general I have
> found it very convenient for my own work. Code for all of this is in
> bsseq/R/hasGRanges.R. I have two classes in the package inheriting
> from hasGRanges. Note that the classes in bsseq have not been updated
> to build on the new SummarizedExperiment (yet).
>
> Kasper
>
>
> On Wed, Sep 5, 2012 at 1:31 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
>> It seems to work with a Matrix in the assays:
>>
>> R> packageVersion('GenomicRanges')
>> [1] ‘1.9.61’
>>
>> R> SE <- SummarizedExperiment(assays=wts, rowData=wts.locs)
>> Error in function (classes, fdef, mtable) :
>> unable to find an inherited method for function "SummarizedExperiment",
>> for signature "dgCMatrix"
>>
>> R> SE <- SummarizedExperiment(assays=SimpleList(weights=wts),
>> rowData=wts.locs)
>> dimnames(.) <- NULL: translated to
>> dimnames(.) <- list(NULL,NULL) <==> unname(.)
>>
>> R> show(SE)
>> class: SummarizedExperiment
>> dim: 35855 35855
>> exptData(0):
>> assays(1): weights
>> rownames(35855): feat1 feat2 ... feat35854 feat35855
>> rowData metadata column names(10): egid signed ... DHS_DGF chromHMM_state
>> colnames(35855): feat1 feat2 ... feat35854 feat35855
>> colData names(0):
>>
>> R> class(assays(SE, 'weights'))
>> [1] "SimpleList"
>> attr(,"package")
>> [1] "IRanges"
>>
>> R> class(assays(SE)$weights)
>> [1] "dgCMatrix"
>> attr(,"package")
>> [1] "Matrix"
>>
>>
>> Very cool! Albeit a tad disorienting. I was going to patch this up and
>> then realized, hey, if someone's going to stuff a Matrix into their assays,
>> they ought to give it a name and know what's going on under the hood, at
>> least for now. It seems to work correctly, too.
>>
>> Oh oh, I spoke too soon. If I split by chromosome, the assay falls out:
>>
>> R> head(assays(SE)$weights)
>> 6 x 35855 sparse Matrix of class "dgCMatrix"
>> [[ suppressing 82 column names ‘feat1’, ‘feat2’, ‘feat3’ ... ]]
>>
>> feat1 1 . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat2 . 1 . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat3 . . 1.0000000 0.5737488 . . . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat4 . . 0.5737488 1.0000000 . . . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat5 . . . . 1.00000000 0.04300373 . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>> feat6 . . . . 0.04300373 1.00000000 . . . . . . . . . . . .
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> . . . . . . . . . . . .
>>
>> feat1 . . . . . . . . . . . . . . ......
>> feat2 . . . . . . . . . . . . . . ......
>> feat3 . . . . . . . . . . . . . . ......
>> feat4 . . . . . . . . . . . . . . ......
>> feat5 . . . . . . . . . . . . . . ......
>> feat6 . . . . . . . . . . . . . . ......
>>
>> .....suppressing columns in show(); maybe adjust 'options(max.print= *)'
>> ..............................
>> R> SE.by.chr <- split(SE, as.vector(seqnames(SE)))
>> R> head(assays(SE.by.chr)$weights)
>> Error in head(assays(SE.by.chr)$weights) :
>> error in evaluating the argument 'x' in selecting a method for function
>> 'head': Error in function (classes, fdef, mtable) :
>> unable to find an inherited method for function "assays", for signature
>> "list"
>>
>> Ideas? I would love to do something useful here instead of just fussing.
>> It's really a great structure.
>>
>> I love the use of a reference class to reference the internal pieces of an
>> object, it's such a clever hack. Thanks much for adding it.
>>
>> --t
>>
>>
>>
>> On Tue, Sep 4, 2012 at 7:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>
>>> On 09/04/2012 07:02 PM, Tim Triche, Jr. wrote:
>>>>
>>>> This is really cool.
>>>>
>>>> R> head(lsos(), 1)
>>>> Type Size Rows Columns
>>>> LAML.RNAseq SummarizedExperiment 38666528 23529 173
>>>> R> LAML.RNAseq.updated <- updateObject(LAML.RNAseq)
>>>> R> head(lsos(), 2)
>>>> Type Size Rows Columns
>>>> LAML.RNAseq SummarizedExperiment 38666528 23529 173
>>>> LAML.RNAseq.updated SummarizedExperiment 6101488 23529 173
>>>
>>>
>>> I think either the sizes are misleading (missing the content of the
>>> environment that the reference classes contain, probably
>>> LAML.RNAseq.updated at assays$data has size 38666528 - 6101488) or the data has
>>> not been updated correctly :\. Nonetheless manipulation of especially
>>> non-assay but also assay data should be faster.
>>>
>>> I added a brief section (1.9.60) to ?SummarizedExperiment describing what
>>> was done. Also, I tried (w/out actually testing...) to add support for
>>> Matrix instances.
>>>
>>> As a miscellaneous 'tip', when working with large data I find it useful to
>>> start R with --min-vsize=2048M --min-nsize=20M. These are documented in
>>> RShowDoc("R-intro") as 'for expert use only' and influence how much memory R
>>> allocates for vectors ('vsize'; vsize can be very large) and for
>>> S-expressions ('nsize' 50000000) and help get up to large memory allocations
>>> without innumerable garbage collections. Of course having big memory is of
>>> primary importance.
>>>
>>> Martin
>>>
>>>> R> str(LAML.RNAseq.updated at assays)
>>>> Reference class 'ShallowSimpleListAssays' [package "GenomicRanges"] with
>>>> 1 fields
>>>> $ data:Formal class 'SimpleList' [package "IRanges"] with 4 slots
>>>> .. ..@ listData :List of 1
>>>> .. .. ..$ rpm: num [1:23529, 1:173] 0 0 8.12 10.84 27.73 ...
>>>> .. ..@ elementType : chr "ANY"
>>>> .. ..@ elementMetadata: NULL
>>>> .. ..@ metadata : list()
>>>> and 11 methods,R>
>>>> R> packageVersion('GenomicRanges')
>>>> [1] ‘1.9.59’
>>>>
>>>> A slightly bigger one:
>>>>
>>>> R> head(lsos(),2)
>>>> Type Size Rows Columns
>>>> LAML SummarizedExperiment 1950606472 485577 192
>>>> LAML.updated SummarizedExperiment 458912760 485577 192
>>>> R> system.time(LAML.updated$foo <- rep('bar', 192))
>>>> user system elapsed
>>>> 0.132 0.116 0.248
>>>> R> system.time(LAML$foo <- rep('bar', 192))
>>>> user system elapsed
>>>> 2.728 2.144 7.519
>>>>
>>>>
>>>> That is a really clever hack you added. I think I will have to figure
>>>> out how it works :-)
>>>>
>>>> Thanks!!!
>>>>
>>>>
>>>> On Sun, Sep 2, 2012 at 2:47 PM, Martin Morgan <mtmorgan at fhcrc.org
>>>> <mailto:mtmorgan at fhcrc.org>> wrote:
>>>>
>>>> On 08/30/2012 08:47 AM, Vincent Carey wrote:
>>>>
>>>> I am all in favor of the dialogue and the aim. I reproduced
>>>> your first
>>>> timing, and note
>>>>
>>>> unix.time(ss2 at colData$tx <- 1:1000)
>>>>
>>>> user system elapsed
>>>> 5.963 4.520 10.483
>>>>
>>>> unix.time(colData(ss2)$tx2 <- 1:1000)
>>>>
>>>> user system elapsed
>>>> 17.937 13.074 31.016
>>>>
>>>> BTW did you really mean to put the meth/unmeth in exptData, or
>>>> should it be
>>>> in assays?
>>>>
>>>> From an experimentation perspective, I would want to redo
>>>> these
>>>>
>>>> computations above with an environment holding the assay data
>>>> and see how
>>>> it changes. I suppose one can infer from first principles that
>>>>
>>>> ss4 at assays[[1]]$meth[1:4,1:4]
>>>>
>>>> [,1] [,2] [,3] [,4]
>>>> [1,] 0.2 0.2 0.2 0.2
>>>> [2,] 0.2 0.2 0.2 0.2
>>>> [3,] 0.2 0.2 0.2 0.2
>>>> [4,] 0.2 0.2 0.2 0.2
>>>>
>>>> ss4 at assays[[2]]$unmeth[1:4,1:__4]
>>>>
>>>>
>>>> [,1] [,2] [,3] [,4]
>>>> [1,] 0.2 0.2 0.2 0.2
>>>> [2,] 0.2 0.2 0.2 0.2
>>>> [3,] 0.2 0.2 0.2 0.2
>>>> [4,] 0.2 0.2 0.2 0.2
>>>>
>>>> unix.time(colData(ss4)$tx2 <- 1:1000)
>>>>
>>>> user system elapsed
>>>> 0.010 0.002 0.012
>>>>
>>>> where ss4 has a SimpleList of environments holding the assay
>>>> data.
>>>>
>>>> ss4
>>>>
>>>> class: SummarizedExperiment
>>>> dim: 450000 1000
>>>> exptData(0):
>>>> assays(2): '' ''
>>>> rownames: NULL
>>>> rowData values names(0):
>>>> colnames: NULL
>>>> colData names(4): sampleID treatment tx tx2
>>>>
>>>> ss4 at assays[[1]]
>>>>
>>>> <environment: 0x1f4e0ba0>
>>>>
>>>> This can't be attempted without disabling validity checking, and
>>>> the stack
>>>> trace from the attempt is instructive.
>>>>
>>>> I don't think it will be too difficult to establish a customized
>>>> SummarizedExperiment that permits this, using the
>>>> lockedEnvironment
>>>> approach of ExpressionSets. I also do not know the down sides
>>>> but it seems
>>>> you/we could get some mileage on the approach and see.
>>>>
>>>>
>>>> GenomicRanges v. 1.9.59 has some preliminary changes that address
>>>> the performance issue. Serialized (saved) objects 'x' need to be
>>>> updated with updateObject(x) (and be careful if these are your only
>>>> instances of some hard work). There are some loose ends to tidy up.
>>>>
>>>> Martin
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 30, 2012 at 10:53 AM, Kasper Daniel Hansen <
>>>> kasperdanielhansen at gmail.com
>>>> <mailto:kasperdanielhansen at gmail.com>> wrote:
>>>>
>>>> My perspective (and the reason for me email) is pretty
>>>> simple.
>>>>
>>>> First off, let me say that we will always be working on
>>>> trading off
>>>> memory and IO and development time and all the other stuff.
>>>> I mean,
>>>> clearly the environment stuff for ExpressionSets did not
>>>> help all
>>>> people, because we have xps, aroma.affymetrix and oligo all
>>>> using some
>>>> kind of file based backend.
>>>>
>>>> I am not trying to solve something "big". Putting
>>>> methylation data
>>>> inside an environment is not suddenly going to make my
>>>> algorithms go
>>>> _significantly_ faster or the memory usage to go
>>>> _significantly_ down.
>>>> But it makes life faster when you explore data - I
>>>> believe.
>>>>
>>>> Now for two realistic examples (one minfi, one bsseq) that I
>>>> encounter
>>>> all the time:
>>>> library(GenomicRanges)
>>>> nRows <- 450000
>>>> nCols <- 1000
>>>> SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>> rep("chr1",
>>>> nRows),
>>>> ranges = IRanges(1:nRows,
>>>> width = 1)),
>>>> colData = DataFrame(sampleID =
>>>> paste0("person", 1:nCols)),
>>>> exptData = SimpleList(meth =
>>>> matrix(0.2,
>>>> ncol = nCols, nrow = nRows),
>>>> unmeth = matrix(0.2, ncol =
>>>> nCols, nrow =
>>>> nRows)))
>>>> print(object.size(SSet), units = "auto") # 6.7GB
>>>> system.time({
>>>> colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
>>>> }) # around 26 sec
>>>>
>>>>
>>>> nRows <- 28000000
>>>> nCols <- 10
>>>> SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
>>>> rep("chr1",
>>>> nRows),
>>>> ranges = IRanges(1:nRows,
>>>> width = 1)),
>>>> colData = DataFrame(sampleID =
>>>> paste0("person", 1:nCols)),
>>>> exptData = SimpleList(meth =
>>>> matrix(0.2,
>>>> ncol = nCols, nrow = nRows),
>>>> unmeth = matrix(0.2, ncol =
>>>> nCols, nrow =
>>>> nRows)))
>>>> print(object.size(SSet), units = "auto") # 4.4GB
>>>> system.time({
>>>> colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
>>>> }) # around 20 sec
>>>>
>>>> This is of course not that slow, especially not when
>>>> compared to the
>>>> massive amount of computation I had to do in order to arrive
>>>> at the
>>>> SSet in the first place. But the difference is that I
>>>> staring at the
>>>> screen when I work interactively, but I am out partying when
>>>> I run my
>>>> computations (overnight). And I do stuff like the above
>>>> surprisingly
>>>> many times after I have generated the objects.
>>>>
>>>> My perspective is that this was essentially "solved" with the
>>>> ExpressionSet class, and I feel we are regressing a bit. I
>>>> also
>>>> believe that it should be "easy" to re-use some of the
>>>> ideas/code from
>>>> ExpressionSet and that it is worthwhile to do so, for
>>>> something that
>>>> may very well be an extremely important core class.
>>>> SummarizedExperiment is such a nice class that I essentially
>>>> have a
>>>> parallel implementation in bsseq and Pete has one in genoset.
>>>>
>>>> So essentially, I was interested in fixing a pretty small
>>>> problem, but
>>>> something that I am a bit irritated about. And I feel we
>>>> almost have
>>>> a ready to use solution.
>>>>
>>>> The reason I ask about it here, and not just do it myself,
>>>> is twofold
>>>> (1) it is tricky to get the environment right, and I do not
>>>> have
>>>> enough experience on the environment part of ExpressionSet,
>>>> that I
>>>> feel comfortable doing it
>>>> (2) I really think it would be beneficial to the community,
>>>> and
>>>> experience tells us, that what happens in these core
>>>> infrastructure
>>>> packages are much more likely to be widely used.
>>>>
>>>> While I think that it may be worthwhile for people to play
>>>> with
>>>> reference classes, I think that going down that route is
>>>> unexplored
>>>> and may or may not be worth it. There is something to be
>>>> said for
>>>> building on something we have experience with. A reference
>>>> class idea
>>>> can be developed in parallel, if people are willing to
>>>> commit the
>>>> time.
>>>>
>>>> Kasper
>>>>
>>>>
>>>> On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr.
>>>> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>>>
>>>> wrote:
>>>>
>>>> I am interested in how best to benchmark such
>>>> operations (and another
>>>> thought that I recently had, "why can't I put a Doug
>>>> Bates style 'Matrix'
>>>> in the SimpleList of assay data)? Kasper is the person
>>>> who convinced me
>>>> that ff, rhdf5, etc.should be a last resort, since RAM
>>>> is cheap.
>>>>
>>>> For large data, the TCGA BRCA samples are a hair under
>>>> 800 (last I
>>>>
>>>> looked)
>>>>
>>>> 450k arrays with matching RNAseq and DNAseq, and that's
>>>> not even a big
>>>> study by breast cancer standards (IIRC, Gordon Mills was
>>>> rounding up 5000
>>>> tumors) Both minfi and methylumi can work off of the raw
>>>> data in these
>>>> studies as a "practical" example, but when it really
>>>> gets interesting is
>>>> when 3-4 ragged assays exist for each patient.
>>>>
>>>> Exon-wise (or splice-graph-wise) RNAseq data with splice
>>>> junction read
>>>> counts is another place to find big SE-centric data with
>>>> ambiguities.
>>>>
>>>> One
>>>>
>>>> of the reasons I keep coming back to the idea of
>>>> subsetting an SE by a
>>>> GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is
>>>> its speed. In any
>>>> event, after my obligations for the day are dispatched,
>>>> I'll write up an
>>>> example of doing this off of TCGA data, perhaps from
>>>> colorectal (only
>>>>
>>>> about
>>>>
>>>> 400 subjects but at least it is published already). You
>>>> are right, a
>>>> single use case is often worth 1000 specifications.
>>>>
>>>> --t
>>>>
>>>>
>>>>
>>>> On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
>>>> <stvjc at channing.harvard.edu
>>>> <mailto:stvjc at channing.harvard.edu>>__wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis
>>>> <sdavis2 at mail.nih.gov <mailto:sdavis2 at mail.nih.gov>>
>>>>
>>>>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan
>>>> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>>>>
>>>>
>>>> wrote:
>>>>
>>>>
>>>> On 08/30/2012 04:42 AM, Vincent Carey wrote:
>>>>
>>>> On Thu, Aug 30, 2012 at 6:27 AM, Tim
>>>> Triche, Jr. <
>>>>
>>>> tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>
>>>>
>>>> wrote:
>>>>
>>>>
>>>> nb. one of the reasons for the
>>>> existence of the MergedDataSet
>>>>
>>>> class in
>>>>
>>>> regulatoR (to be submitted for
>>>> review shortly) is that, while SEs
>>>>
>>>> are
>>>>
>>>> absolutely fantastic for managing
>>>> data matrices that are stapled to
>>>>
>>>> a
>>>>
>>>> GRanges, what is less awesome is
>>>> having a relatively light-weight
>>>> DataFrame
>>>> for phenotypic data that requires
>>>> the entire memory footprint be
>>>> recreated
>>>> upon writing a new column into said
>>>> DataFrame.
>>>>
>>>> If R5 classes didn't spook me a
>>>> little, I would already have done
>>>> something
>>>>
>>>>
>>>> We don't often use this R5 terminology
>>>> but I see Hadley has made an
>>>> accessible document referring to
>>>> reference classes in this way.
>>>>
>>>>
>>>> To me the challenge is more conceptual --
>>>> pass-by-reference and the
>>>>
>>>> way
>>>>
>>>> that two variables pointing to the instance
>>>> are updated at the same
>>>>
>>>> time --
>>>>
>>>> and I had been thinking of a
>>>> LockedEnvironment-style implementation
>>>>
>>>> where
>>>>
>>>> some operations were free ('copying') but
>>>> others weren't (subset,
>>>>
>>>> subset
>>>>
>>>> assign). But maybe there are some more
>>>> direct approaches...
>>>>
>>>>
>>>> My 2c: This is a situation where some
>>>> experimental data would be
>>>>
>>>> helpful.
>>>>
>>>>
>>>>
>>>> Perhaps Tim or Kasper can share a largish
>>>> (unpublished) dataset and a
>>>> typical workflow with the Seattle folks. Even
>>>> that level of detail
>>>>
>>>> would
>>>>
>>>> give some sense of the scope of the problem.
>>>>
>>>>
>>>>
>>>> For clarification, I did not mean large data,
>>>> although that would be
>>>> welcome. I meant data on computational experiments
>>>> with different
>>>> approaches -- that is, performance statistics that
>>>> would indicate where
>>>> scalability fails, what steps have been taken to
>>>> recover it, and how
>>>>
>>>> costly
>>>>
>>>> those steps are in terms of complexity, reliability,
>>>> maintainability,
>>>>
>>>> risk
>>>>
>>>> at the user/application end.
>>>>
>>>>
>>>>
>>>> Sean
>>>>
>>>>
>>>> Yes, for instance where in the interactive
>>>> use is time being spent? Is
>>>> it copying the assays, or validity, or
>>>> actually updating the row
>>>>
>>>> data? Is
>>>>
>>>> 500000 x 800 an appropriate scale to be
>>>> thinking about?
>>>>
>>>>
>>>> The main avenues for a developer seem
>>>> to be a) use environments or
>>>>
>>>> reference classes; there are some costs
>>>> and we should understand
>>>>
>>>> them,
>>>>
>>>> and
>>>> b) use an out-of-memory approach like
>>>> rhdf5 or ff. Again there will
>>>>
>>>> be
>>>>
>>>> some costs. It should be relatively
>>>> easy to experiment with these.
>>>>
>>>> One
>>>>
>>>> thing I just learned about is
>>>> setValidity2 and disableValidity
>>>>
>>>> (defined
>>>>
>>>> in
>>>> IRanges IIRC) ... these allow you to
>>>> construct certain variations on
>>>> SummarizedExperiment with less attention
>>>> to deeper infrastructure.
>>>>
>>>>
>>>> probably I can make better use of the
>>>> insights the IRanges guys have
>>>>
>>>> had
>>>>
>>>> in their careful development and application
>>>> of validity methods,
>>>>
>>>> though I
>>>>
>>>> feel a bit like these are 'attractive
>>>> hazards' that tempt us to do
>>>>
>>>> unsafe
>>>>
>>>> things and then pay the price later. This is
>>>> likely the first
>>>>
>>>> direction
>>>>
>>>> I'll explore.
>>>>
>>>> Exploring a little I already see that there
>>>> are some pretty dumb
>>>>
>>>> things
>>>>
>>>> being done in assignment.
>>>>
>>>> Martin
>>>>
>>>>
>>>> whereby the assays/libraries for a given
>>>> study subject are all
>>>>
>>>> pointed
>>>>
>>>> to
>>>> as SEs (i.e. RNAseq, BSseq,
>>>> expression/methylation arrays,
>>>> CNV/SNP
>>>> arrays,
>>>> WGS or exomic DNAseq) and the column
>>>> (phenotype) data can avoid
>>>>
>>>> being
>>>>
>>>> subject to these constraints. Truth
>>>> be told I *still* want to do
>>>>
>>>> that
>>>>
>>>> because, most of the time, updates
>>>> to the latter are independent of,
>>>> and
>>>> happen subsequently to, loading the
>>>> former.
>>>>
>>>> Suggestions would be welcome,
>>>> because other than these minor
>>>>
>>>> niggles,
>>>>
>>>> the
>>>> SummarizedExperiment class is almost
>>>> perfect for many tasks.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 29, 2012 at 9:57 PM, Tim
>>>> Triche, Jr. <
>>>>
>>>> tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>>>>
>>>>
>>>>
>>>> wrote:
>>>>
>>>>
>>>> assigning new colData columns, or
>>>> overwriting old ones, in a
>>>>
>>>> sizable
>>>>
>>>> (say
>>>> 500000 row x 800 column) SE is
>>>> nauseatingly slow.
>>>>
>>>>
>>>>
>>>>
>>>> There has to be a better way --
>>>> I'm willing to write it if
>>>> someone
>>>>
>>>> can
>>>>
>>>> point out an obvious way to do it
>>>>
>>>>
>>>>
>>>> On Wed, Aug 29, 2012 at 9:52 PM,
>>>> Kasper Daniel Hansen <
>>>> kasperdanielhansen at gmail.com
>>>>
>>>> <mailto:kasperdanielhansen at gmail.com>>
>>>>
>>>> wrote:
>>>>
>>>> On Thu, Aug 30, 2012 at 12:44
>>>> AM, Martin Morgan <
>>>>
>>>> mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>>>>
>>>>
>>>> wrote:
>>>>
>>>> On 08/29/2012 06:46 PM,
>>>> Kasper Daniel Hansen
>>>> wrote:
>>>>
>>>>
>>>> There is a lot of
>>>> good stuff to say
>>>> about
>>>>
>>>> SummarizedExperiments,
>>>> and
>>>> from a certain point
>>>> of view I have a
>>>> parallel
>>>> implementation in
>>>>
>>>> bsseq
>>>>
>>>>
>>>> (and there is also one in
>>>> genoset).
>>>>
>>>>
>>>> However, I really
>>>> like having the
>>>> assayData inside an
>>>>
>>>> environment.
>>>>
>>>> This helps some on
>>>> memory and - equally
>>>> important - speed at
>>>> the
>>>> command line. I
>>>> certainly need to
>>>> very heavily
>>>> consider using
>>>>
>>>> an
>>>>
>>>> environment in bsseq.
>>>>
>>>> After some
>>>> discussion with Tim
>>>> (Triche) we have
>>>> agreed that
>>>> something
>>>> like
>>>> SummarizedExperiments
>>>> is
>>>> the way to go at
>>>> least for the
>>>> methylation arrays.
>>>> We need to be able
>>>> to easily handle
>>>> 1000s
>>>>
>>>> of
>>>>
>>>> samples.
>>>>
>>>> What is the chance
>>>> that we can get the
>>>> option of having the
>>>> assayData
>>>> inside an
>>>> environment, perhaps
>>>> by
>>>> Making a class
>>>> that is an
>>>> environment and
>>>> inherits from
>>>>
>>>> SimpleList.
>>>>
>>>>
>>>> Using a classUnion
>>>> between the existing class of
>>>> the assayData
>>>>
>>>> and
>>>> an environment.
>>>> Third option
>>>> that is probably
>>>> better than the
>>>> proceeding
>>>>
>>>> two,
>>>>
>>>> but
>>>> which I cannot come
>>>> up with right now.
>>>>
>>>>
>>>>
>>>> Probably something can /
>>>> will be done. I guess
>>>> the slowness
>>>>
>>>> you're
>>>>
>>>>
>>>> talking
>>>>
>>>> about is when rowData /
>>>> colData columns are
>>>> manipulated; any
>>>>
>>>> kind of
>>>>
>>>> subsetting would mean a
>>>> 'deep' copy. Martin
>>>>
>>>>
>>>> Yes, for example
>>>> manipulating colData -
>>>> something that
>>>>
>>>> conceptually
>>>>
>>>> should be quick and easy.
>>>> Of course, this will not
>>>> affect any
>>>>
>>>> real
>>>>
>>>> computation on the assayData
>>>> matrices, but it will make
>>>> life at
>>>>
>>>> the
>>>>
>>>> command prompt more pleasant.
>>>>
>>>> Kasper
>>>>
>>>>
>>>>
>>>>
>>>> This would - in my
>>>> opinion - be very
>>>> nice and worthwhile.
>>>>
>>>> Kasper
>>>>
>>>>
>>>> ________________________________**_________________
>>>>
>>>> Bioc-devel at r-project.org
>>>>
>>>> <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biology /
>>>> Fred Hutchinson Cancer
>>>> Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA
>>>> 98109
>>>>
>>>> Location: Arnold
>>>> Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>> <tel:%28206%29%20667-2793>
>>>>
>>>>
>>>>
>>>> ________________________________**_________________
>>>> Bioc-devel at r-project.org
>>>>
>>>> <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *A model is a lie that helps you
>>>> see the truth.*
>>>> *
>>>> *
>>>> Howard Skipper<
>>>>
>>>>
>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>
>>>>
>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>
>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>
>>>>
>>>> **>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *A model is a lie that helps you see
>>>> the truth.*
>>>> *
>>>> *
>>>> Howard Skipper<
>>>>
>>>> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>>>>
>>>>
>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>
>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>
>>>>
>>>> **>
>>>>
>>>> [[alternative HTML
>>>> version deleted]]
>>>>
>>>>
>>>> ________________________________**_________________
>>>> Bioc-devel at r-project.org
>>>> <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>> [[alternative HTML version
>>>> deleted]]
>>>>
>>>>
>>>> ________________________________**_________________
>>>> Bioc-devel at r-project.org
>>>> <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson
>>>> Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>> <tel:%28206%29%20667-2793>
>>>>
>>>>
>>>> ________________________________**_________________
>>>> Bioc-devel at r-project.org
>>>> <mailto:Bioc-devel at r-project.org> mailing
>>>> list
>>>>
>>>> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>>>>
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *A model is a lie that helps you see the truth.*
>>>> *
>>>> *
>>>> Howard Skipper<
>>>>
>>>>
>>>> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
>>>>
>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _________________________________________________
>>>> Bioc-devel at r-project.org
>>>> <mailto:Bioc-devel at r-project.org> mailing list
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _________________________________________________
>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>> mailing list
>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>>>>
>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> /A model is a lie that helps you see the truth./
>>>> /
>>>> /
>>>> Howard Skipper
>>>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>
>>
>>
>>
>> --
>> A model is a lie that helps you see the truth.
>>
>> Howard Skipper
>>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list