[Bioc-devel] minimizing copies when creating ExpressionSet

Martin Morgan mtmorgan at fhcrc.org
Sun Nov 8 02:24:42 CET 2009


Hi Benilton --

I think through the 'front door' and in the current release / devel
versions, the answer is no. The problem is that the row and column names
of assayData, phenoData, protocolData and featureData are all made to be
the same, and this is done by identifying the appropriate names and
doing the assignment, e.g., the equivalent of
colnames(assayData[["exprs"]]) <- ... But this triggers a copy of
assayData[["exprs"]], so doubles the memory requirement.

But if the row / col names are made identical ahead of time, then one
can make some headway by building up the appropriate data components,
including coordinating the row and column names 'up front'

library(Biobase)
assayData <- assayDataNew(exprs=matrix(0., 6.5e6, 70,
                            dimnames=list(seq_len(6.5e6), seq_len(70))))
phenoData <- annotatedDataFrameFrom(assayData[["exprs"]], FALSE)
protocolData <- annotatedDataFrameFrom(assayData[["exprs"]], FALSE)
featureData <- annotatedDataFrameFrom(assayData[["exprs"]], TRUE)

and then creating and assembling the ExpressionSet one slot at a time,
being careful to ensure that the resulting object is valid

eset <- new("ExpressionSet")
slot(eset, "assayData") <- assayData
slot(eset, "phenoData") <- phenoData
slot(eset, "featureData") <- featureData
slot(eset, "protocolData") <- protocolData

> validObject(eset)
[1] TRUE
> dim(eset)
Features  Samples
 6500000       70

I sort of feel like this is a "rogue's game", where the user will fairly
quickly run into the situation where they want to do something that
triggers a copy of the large data, and then they're in trouble again.

> eset1 <- eset[,-1]
Error: cannot allocate vector of size 3.3 Gb

Martin


Benilton Carvalho wrote:
> my bad... after creating either y1 or y1, resident memory used is
> rouhgly 10GB (i'm counting here the 'x' object too, so i think about 7GB
> is used to create either object).
> 
> my question is if there's something i'm missing that would minimize the
> use of these 7gb....
> 
> sorry for the typo and possibly not making myself clear.
> 
> b
> 
> 
> On Nov 7, 2009, at 6:11 PM, Benilton Carvalho wrote:
> 
>> Hi,
>>
>> given the following:
>>
>>
>> library(Biobase)
>> x = matrix(pi, nr=6.5e6, nc=70)  ##3.4GB
>> y1 = new("ExpressionSet", exprs=x)
>> y2 = new("ExpressionSet", assayData=assayDataNew("environment",
>> exprs=x))
>>
>> Is there any obvious way of reducing the memory footprint when
>> creating y1 and/or y2?  With y1, it takes me around 18GB RAM... with
>> y2, around 10GB. Is there anything else I can do from my end to
>> minimize this?
>>
>> Thanks a lot,
>>
>> b
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list