[Bioc-devel] SummarizedExperiments
Martin Morgan
mtmorgan at fhcrc.org
Wed Sep 5 04:58:26 CEST 2012
On 09/04/2012 07:02 PM, Tim Triche, Jr. wrote:
> This is really cool.
>
> R> head(lsos(), 1)
> Type Size Rows Columns
> LAML.RNAseq SummarizedExperiment 38666528 23529 173
> R> LAML.RNAseq.updated <- updateObject(LAML.RNAseq)
> R> head(lsos(), 2)
> Type Size Rows Columns
> LAML.RNAseq SummarizedExperiment 38666528 23529 173
> LAML.RNAseq.updated SummarizedExperiment 6101488 23529 173
I think either the sizes are misleading (missing the content of the
environment that the reference classes contain, probably
LAML.RNAseq.updated at assays$data has size 38666528 - 6101488) or the data
has not been updated correctly :\. Nonetheless manipulation of
especially non-assay but also assay data should be faster.
I added a brief section (1.9.60) to ?SummarizedExperiment describing
what was done. Also, I tried (w/out actually testing...) to add support
for Matrix instances.
As a miscellaneous 'tip', when working with large data I find it useful
to start R with --min-vsize=2048M --min-nsize=20M. These are documented
in RShowDoc("R-intro") as 'for expert use only' and influence how much
memory R allocates for vectors ('vsize'; vsize can be very large) and
for S-expressions ('nsize' 50000000) and help get up to large memory
allocations without innumerable garbage collections. Of course having
big memory is of primary importance.
Martin
> R> str(LAML.RNAseq.updated at assays)
> Reference class 'ShallowSimpleListAssays' [package "GenomicRanges"] with
> 1 fields
> $ data:Formal class 'SimpleList' [package "IRanges"] with 4 slots
> .. ..@ listData :List of 1
> .. .. ..$ rpm: num [1:23529, 1:173] 0 0 8.12 10.84 27.73 ...
> .. ..@ elementType : chr "ANY"
> .. ..@ elementMetadata: NULL
> .. ..@ metadata : list()
> and 11 methods,R>
> R> packageVersion('GenomicRanges')
> [1] ‘1.9.59’
>
> A slightly bigger one:
>
> R> head(lsos(),2)
> Type Size Rows Columns
> LAML SummarizedExperiment 1950606472 485577 192
> LAML.updated SummarizedExperiment 458912760 485577 192
> R> system.time(LAML.updated$foo <- rep('bar', 192))
> user system elapsed
> 0.132 0.116 0.248
> R> system.time(LAML$foo <- rep('bar', 192))
> user system elapsed
> 2.728 2.144 7.519
>
>
> That is a really clever hack you added. I think I will have to figure
> out how it works :-)
>
> Thanks!!!
>
>
> On Sun, Sep 2, 2012 at 2:47 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> On 08/30/2012 08:47 AM, Vincent Carey wrote:
>
> I am all in favor of the dialogue and the aim. I reproduced
> your first
> timing, and note
>
> unix.time(ss2 at colData$tx <- 1:1000)
>
> user system elapsed
> 5.963 4.520 10.483
>
> unix.time(colData(ss2)$tx2 <- 1:1000)
>
> user system elapsed
> 17.937 13.074 31.016
>
> BTW did you really mean to put the meth/unmeth in exptData, or
> should it be
> in assays?
>
> From an experimentation perspective, I would want to redo these
>
> computations above with an environment holding the assay data
> and see how
> it changes. I suppose one can infer from first principles that
>
> ss4 at assays[[1]]$meth[1:4,1:4]
>
> [,1] [,2] [,3] [,4]
> [1,] 0.2 0.2 0.2 0.2
> [2,] 0.2 0.2 0.2 0.2
> [3,] 0.2 0.2 0.2 0.2
> [4,] 0.2 0.2 0.2 0.2
>
> ss4 at assays[[2]]$unmeth[1:4,1:__4]
>
> [,1] [,2] [,3] [,4]
> [1,] 0.2 0.2 0.2 0.2
> [2,] 0.2 0.2 0.2 0.2
> [3,] 0.2 0.2 0.2 0.2
> [4,] 0.2 0.2 0.2 0.2
>
> unix.time(colData(ss4)$tx2 <- 1:1000)
>
> user system elapsed
> 0.010 0.002 0.012
>
> where ss4 has a SimpleList of environments holding the assay data.
>
> ss4
>
> class: SummarizedExperiment
> dim: 450000 1000
> exptData(0):
> assays(2): '' ''
> rownames: NULL
> rowData values names(0):
> colnames: NULL
> colData names(4): sampleID treatment tx tx2
>
> ss4 at assays[[1]]
>
> <environment: 0x1f4e0ba0>
>
> This can't be attempted without disabling validity checking, and
> the stack
> trace from the attempt is instructive.
>
> I don't think it will be too difficult to establish a customized
> SummarizedExperiment that permits this, using the lockedEnvironment
> approach of ExpressionSets. I also do not know the down sides
> but it seems
> you/we could get some mileage on the approach and see.
>
>
> GenomicRanges v. 1.9.59 has some preliminary changes that address
> the performance issue. Serialized (saved) objects 'x' need to be
> updated with updateObject(x) (and be careful if these are your only
> instances of some hard work). There are some loose ends to tidy up.
>
> Martin
>
>
>
>
> On Thu, Aug 30, 2012 at 10:53 AM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com
> <mailto:kasperdanielhansen at gmail.com>> wrote:
>
> My perspective (and the reason for me email) is pretty simple.
>
> First off, let me say that we will always be working on
> trading off
> memory and IO and development time and all the other stuff.
> I mean,
> clearly the environment stuff for ExpressionSets did not
> help all
> people, because we have xps, aroma.affymetrix and oligo all
> using some
> kind of file based backend.
>
> I am not trying to solve something "big". Putting
> methylation data
> inside an environment is not suddenly going to make my
> algorithms go
> _significantly_ faster or the memory usage to go
> _significantly_ down.
> But it makes life faster when you explore data - I believe.
>
> Now for two realistic examples (one minfi, one bsseq) that I
> encounter
> all the time:
> library(GenomicRanges)
> nRows <- 450000
> nCols <- 1000
> SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
> rep("chr1",
> nRows),
> ranges = IRanges(1:nRows,
> width = 1)),
> colData = DataFrame(sampleID =
> paste0("person", 1:nCols)),
> exptData = SimpleList(meth =
> matrix(0.2,
> ncol = nCols, nrow = nRows),
> unmeth = matrix(0.2, ncol =
> nCols, nrow =
> nRows)))
> print(object.size(SSet), units = "auto") # 6.7GB
> system.time({
> colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
> }) # around 26 sec
>
>
> nRows <- 28000000
> nCols <- 10
> SSet <- SummarizedExperiment(rowData = GRanges(seqnames =
> rep("chr1",
> nRows),
> ranges = IRanges(1:nRows,
> width = 1)),
> colData = DataFrame(sampleID =
> paste0("person", 1:nCols)),
> exptData = SimpleList(meth =
> matrix(0.2,
> ncol = nCols, nrow = nRows),
> unmeth = matrix(0.2, ncol =
> nCols, nrow =
> nRows)))
> print(object.size(SSet), units = "auto") # 4.4GB
> system.time({
> colData(SSet)$treatment = rep(c("a", "b"), nCols / 2)
> }) # around 20 sec
>
> This is of course not that slow, especially not when
> compared to the
> massive amount of computation I had to do in order to arrive
> at the
> SSet in the first place. But the difference is that I
> staring at the
> screen when I work interactively, but I am out partying when
> I run my
> computations (overnight). And I do stuff like the above
> surprisingly
> many times after I have generated the objects.
>
> My perspective is that this was essentially "solved" with the
> ExpressionSet class, and I feel we are regressing a bit. I also
> believe that it should be "easy" to re-use some of the
> ideas/code from
> ExpressionSet and that it is worthwhile to do so, for
> something that
> may very well be an extremely important core class.
> SummarizedExperiment is such a nice class that I essentially
> have a
> parallel implementation in bsseq and Pete has one in genoset.
>
> So essentially, I was interested in fixing a pretty small
> problem, but
> something that I am a bit irritated about. And I feel we
> almost have
> a ready to use solution.
>
> The reason I ask about it here, and not just do it myself,
> is twofold
> (1) it is tricky to get the environment right, and I do not have
> enough experience on the environment part of ExpressionSet,
> that I
> feel comfortable doing it
> (2) I really think it would be beneficial to the community, and
> experience tells us, that what happens in these core
> infrastructure
> packages are much more likely to be widely used.
>
> While I think that it may be worthwhile for people to play with
> reference classes, I think that going down that route is
> unexplored
> and may or may not be worth it. There is something to be
> said for
> building on something we have experience with. A reference
> class idea
> can be developed in parallel, if people are willing to
> commit the
> time.
>
> Kasper
>
>
> On Thu, Aug 30, 2012 at 10:12 AM, Tim Triche, Jr.
> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
> wrote:
>
> I am interested in how best to benchmark such
> operations (and another
> thought that I recently had, "why can't I put a Doug
> Bates style 'Matrix'
> in the SimpleList of assay data)? Kasper is the person
> who convinced me
> that ff, rhdf5, etc.should be a last resort, since RAM
> is cheap.
>
> For large data, the TCGA BRCA samples are a hair under
> 800 (last I
>
> looked)
>
> 450k arrays with matching RNAseq and DNAseq, and that's
> not even a big
> study by breast cancer standards (IIRC, Gordon Mills was
> rounding up 5000
> tumors) Both minfi and methylumi can work off of the raw
> data in these
> studies as a "practical" example, but when it really
> gets interesting is
> when 3-4 ragged assays exist for each patient.
>
> Exon-wise (or splice-graph-wise) RNAseq data with splice
> junction read
> counts is another place to find big SE-centric data with
> ambiguities.
>
> One
>
> of the reasons I keep coming back to the idea of
> subsetting an SE by a
> GRanges[List], e.g. SE.totalRNA[ UCSC.lincRNAs, ], is
> its speed. In any
> event, after my obligations for the day are dispatched,
> I'll write up an
> example of doing this off of TCGA data, perhaps from
> colorectal (only
>
> about
>
> 400 subjects but at least it is published already). You
> are right, a
> single use case is often worth 1000 specifications.
>
> --t
>
>
>
> On Thu, Aug 30, 2012 at 6:42 AM, Vincent Carey
> <stvjc at channing.harvard.edu
> <mailto:stvjc at channing.harvard.edu>>__wrote:
>
>
>
> On Thu, Aug 30, 2012 at 9:16 AM, Sean Davis
> <sdavis2 at mail.nih.gov <mailto:sdavis2 at mail.nih.gov>>
>
> wrote:
>
>
>
>
> On Thu, Aug 30, 2012 at 8:59 AM, Martin Morgan
> <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
>
> wrote:
>
>
> On 08/30/2012 04:42 AM, Vincent Carey wrote:
>
> On Thu, Aug 30, 2012 at 6:27 AM, Tim
> Triche, Jr. <
>
> tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>
> wrote:
>
>
> nb. one of the reasons for the
> existence of the MergedDataSet
>
> class in
>
> regulatoR (to be submitted for
> review shortly) is that, while SEs
>
> are
>
> absolutely fantastic for managing
> data matrices that are stapled to
>
> a
>
> GRanges, what is less awesome is
> having a relatively light-weight
> DataFrame
> for phenotypic data that requires
> the entire memory footprint be
> recreated
> upon writing a new column into said
> DataFrame.
>
> If R5 classes didn't spook me a
> little, I would already have done
> something
>
>
> We don't often use this R5 terminology
> but I see Hadley has made an
> accessible document referring to
> reference classes in this way.
>
>
> To me the challenge is more conceptual --
> pass-by-reference and the
>
> way
>
> that two variables pointing to the instance
> are updated at the same
>
> time --
>
> and I had been thinking of a
> LockedEnvironment-style implementation
>
> where
>
> some operations were free ('copying') but
> others weren't (subset,
>
> subset
>
> assign). But maybe there are some more
> direct approaches...
>
>
> My 2c: This is a situation where some
> experimental data would be
>
> helpful.
>
>
>
> Perhaps Tim or Kasper can share a largish
> (unpublished) dataset and a
> typical workflow with the Seattle folks. Even
> that level of detail
>
> would
>
> give some sense of the scope of the problem.
>
>
>
> For clarification, I did not mean large data,
> although that would be
> welcome. I meant data on computational experiments
> with different
> approaches -- that is, performance statistics that
> would indicate where
> scalability fails, what steps have been taken to
> recover it, and how
>
> costly
>
> those steps are in terms of complexity, reliability,
> maintainability,
>
> risk
>
> at the user/application end.
>
>
>
> Sean
>
>
> Yes, for instance where in the interactive
> use is time being spent? Is
> it copying the assays, or validity, or
> actually updating the row
>
> data? Is
>
> 500000 x 800 an appropriate scale to be
> thinking about?
>
>
> The main avenues for a developer seem
> to be a) use environments or
>
> reference classes; there are some costs
> and we should understand
>
> them,
>
> and
> b) use an out-of-memory approach like
> rhdf5 or ff. Again there will
>
> be
>
> some costs. It should be relatively
> easy to experiment with these.
>
> One
>
> thing I just learned about is
> setValidity2 and disableValidity
>
> (defined
>
> in
> IRanges IIRC) ... these allow you to
> construct certain variations on
> SummarizedExperiment with less attention
> to deeper infrastructure.
>
>
> probably I can make better use of the
> insights the IRanges guys have
>
> had
>
> in their careful development and application
> of validity methods,
>
> though I
>
> feel a bit like these are 'attractive
> hazards' that tempt us to do
>
> unsafe
>
> things and then pay the price later. This is
> likely the first
>
> direction
>
> I'll explore.
>
> Exploring a little I already see that there
> are some pretty dumb
>
> things
>
> being done in assignment.
>
> Martin
>
>
> whereby the assays/libraries for a given
> study subject are all
>
> pointed
>
> to
> as SEs (i.e. RNAseq, BSseq,
> expression/methylation arrays, CNV/SNP
> arrays,
> WGS or exomic DNAseq) and the column
> (phenotype) data can avoid
>
> being
>
> subject to these constraints. Truth
> be told I *still* want to do
>
> that
>
> because, most of the time, updates
> to the latter are independent of,
> and
> happen subsequently to, loading the
> former.
>
> Suggestions would be welcome,
> because other than these minor
>
> niggles,
>
> the
> SummarizedExperiment class is almost
> perfect for many tasks.
>
>
>
>
> On Wed, Aug 29, 2012 at 9:57 PM, Tim
> Triche, Jr. <
>
> tim.triche at gmail.com <mailto:tim.triche at gmail.com>
>
>
> wrote:
>
>
> assigning new colData columns, or
> overwriting old ones, in a
>
> sizable
>
> (say
> 500000 row x 800 column) SE is
> nauseatingly slow.
>
>
>
>
> There has to be a better way --
> I'm willing to write it if someone
>
> can
>
> point out an obvious way to do it
>
>
>
> On Wed, Aug 29, 2012 at 9:52 PM,
> Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com
> <mailto:kasperdanielhansen at gmail.com>>
> wrote:
>
> On Thu, Aug 30, 2012 at 12:44
> AM, Martin Morgan <
>
> mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>
> wrote:
>
> On 08/29/2012 06:46 PM,
> Kasper Daniel Hansen wrote:
>
>
> There is a lot of
> good stuff to say
> about
> SummarizedExperiments,
> and
> from a certain point
> of view I have a
> parallel
> implementation in
>
> bsseq
>
>
> (and there is also one in
> genoset).
>
>
> However, I really
> like having the
> assayData inside an
>
> environment.
>
> This helps some on
> memory and - equally
> important - speed at the
> command line. I
> certainly need to
> very heavily
> consider using
>
> an
>
> environment in bsseq.
>
> After some
> discussion with Tim
> (Triche) we have
> agreed that
> something
> like
> SummarizedExperiments is
> the way to go at
> least for the
> methylation arrays.
> We need to be able
> to easily handle 1000s
>
> of
>
> samples.
>
> What is the chance
> that we can get the
> option of having the
> assayData
> inside an
> environment, perhaps by
> Making a class
> that is an
> environment and
> inherits from
>
> SimpleList.
>
>
> Using a classUnion
> between the existing class of
> the assayData
>
> and
> an environment.
> Third option
> that is probably
> better than the
> proceeding
>
> two,
>
> but
> which I cannot come
> up with right now.
>
>
>
> Probably something can /
> will be done. I guess
> the slowness
>
> you're
>
>
> talking
>
> about is when rowData /
> colData columns are
> manipulated; any
>
> kind of
>
> subsetting would mean a
> 'deep' copy. Martin
>
>
> Yes, for example
> manipulating colData -
> something that
>
> conceptually
>
> should be quick and easy.
> Of course, this will not
> affect any
>
> real
>
> computation on the assayData
> matrices, but it will make
> life at
>
> the
>
> command prompt more pleasant.
>
> Kasper
>
>
>
>
> This would - in my
> opinion - be very
> nice and worthwhile.
>
> Kasper
>
> ________________________________**_________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
> --
> Computational Biology /
> Fred Hutchinson Cancer
> Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA
> 98109
>
> Location: Arnold
> Building M1 B861
> Phone: (206) 667-2793
> <tel:%28206%29%20667-2793>
>
>
> ________________________________**_________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
>
> --
> *A model is a lie that helps you
> see the truth.*
> *
> *
> Howard Skipper<
>
> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>
> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>
> **>
>
>
>
>
>
> --
> *A model is a lie that helps you see
> the truth.*
> *
> *
> Howard Skipper<
> http://cancerres.aacrjournals.__**org/content/31/9/1173.full.__pdf<
>
> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>
> **>
>
> [[alternative HTML
> version deleted]]
>
> ________________________________**_________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
> [[alternative HTML version
> deleted]]
>
> ________________________________**_________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
> --
> Computational Biology / Fred Hutchinson
> Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
> ________________________________**_________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/*__*listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel><
>
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>
>
>
>
>
>
>
>
> --
> *A model is a lie that helps you see the truth.*
> *
> *
> Howard Skipper<
>
> http://cancerres.aacrjournals.__org/content/31/9/1173.full.pdf
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>__>
>
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioc-devel at r-project.org
> <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list