[Bioc-devel] minimizing copies when creating ExpressionSet

Wed Nov 11 14:45:44 CET 2009

Hi Martin,

thanks for your suggestion. It is of significant help.

Unfortunately, I have an extra issue: the parser I use does not set  
dimnames as I expect. So, I have to set dimnames() and that triggers a  
copy of the matrix. So, at least for now, there isn't anything I can do.

But thank you very much for the hint,

With best wishes,

b

On Nov 7, 2009, at 11:24 PM, Martin Morgan wrote:

> Hi Benilton --
>
> I think through the 'front door' and in the current release / devel
> versions, the answer is no. The problem is that the row and column  
> names
> of assayData, phenoData, protocolData and featureData are all made  
> to be
> the same, and this is done by identifying the appropriate names and
> doing the assignment, e.g., the equivalent of
> colnames(assayData[["exprs"]]) <- ... But this triggers a copy of
> assayData[["exprs"]], so doubles the memory requirement.
>
> But if the row / col names are made identical ahead of time, then one
> can make some headway by building up the appropriate data components,
> including coordinating the row and column names 'up front'
>
> library(Biobase)
> assayData <- assayDataNew(exprs=matrix(0., 6.5e6, 70,
>                            dimnames=list(seq_len(6.5e6),  
> seq_len(70))))
> phenoData <- annotatedDataFrameFrom(assayData[["exprs"]], FALSE)
> protocolData <- annotatedDataFrameFrom(assayData[["exprs"]], FALSE)
> featureData <- annotatedDataFrameFrom(assayData[["exprs"]], TRUE)
>
> and then creating and assembling the ExpressionSet one slot at a time,
> being careful to ensure that the resulting object is valid
>
> eset <- new("ExpressionSet")
> slot(eset, "assayData") <- assayData
> slot(eset, "phenoData") <- phenoData
> slot(eset, "featureData") <- featureData
> slot(eset, "protocolData") <- protocolData
>
>> validObject(eset)
> [1] TRUE
>> dim(eset)
> Features  Samples
> 6500000       70
>
> I sort of feel like this is a "rogue's game", where the user will  
> fairly
> quickly run into the situation where they want to do something that
> triggers a copy of the large data, and then they're in trouble again.
>
>> eset1 <- eset[,-1]
> Error: cannot allocate vector of size 3.3 Gb
>
> Martin
>
>
> Benilton Carvalho wrote:
>> my bad... after creating either y1 or y1, resident memory used is
>> rouhgly 10GB (i'm counting here the 'x' object too, so i think  
>> about 7GB
>> is used to create either object).
>>
>> my question is if there's something i'm missing that would minimize  
>> the
>> use of these 7gb....
>>
>> sorry for the typo and possibly not making myself clear.
>>
>> b
>>
>>
>> On Nov 7, 2009, at 6:11 PM, Benilton Carvalho wrote:
>>
>>> Hi,
>>>
>>> given the following:
>>>
>>>
>>> library(Biobase)
>>> x = matrix(pi, nr=6.5e6, nc=70)  ##3.4GB
>>> y1 = new("ExpressionSet", exprs=x)
>>> y2 = new("ExpressionSet", assayData=assayDataNew("environment",
>>> exprs=x))
>>>
>>> Is there any obvious way of reducing the memory footprint when
>>> creating y1 and/or y2?  With y1, it takes me around 18GB RAM... with
>>> y2, around 10GB. Is there anything else I can do from my end to
>>> minimize this?
>>>
>>> Thanks a lot,
>>>
>>> b
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793