[Bioc-devel] strange behavior on memory usage
Wolfgang Huber
huber at ebi.ac.uk
Tue Aug 23 23:47:41 CEST 2005
Hi Vince, et al.
it seems to me the problem is bigger than just fixing the "show" method
and caching (duplicating) e.g. the dimension information in extra slots.
I am a bit worried that if "getExpData" is such a memory hog the whole
eSet class becomes much less useful - and people might be tempted to
revert back to using simple matrices for performance-critical
computations. Is there a better way to do this avoiding such overhead
with "getExpData" in the first place? (I guess we might need somebody
who understands the memory management in R and perhaps even can write
some of the necessary infrastructure in C.)
What I don't understand in Benilton's Email (one of the many things) is
this "ps: i just noticed that using dim(exprs(x)) in show() reduces the
memory usage from 6GB to 3.5GB... " but the implementation of exprs() is
setMethod("exprs", "eSet",
function(object) getExpData(object, "exprs")
)
i.e. it just calls getExpData:
setMethod("getExpData", c("eSet", "character"),
function(object, name) {
object at eList[[name]] })
Best,
Wolfgang
Vincent Carey 525-2265 wrote:
>>hi everyone,
>>
>>i was wondering if anybody could give me a hint of what causes a strange
>>behavior on memory usage when using oligo/makePlatformDesign packages.
>>
>>i'm reading a bunch of (affy) SNP chips:
>>
>>
>>>x = read.celfiles(list.celfiles())
>>
>> -> at this point the R process uses around 2GB
>> -> which does not look bad, since i'm reading 90 samples
>>
>>>show(x)
>>
>> -> now the R process uses around 6GB
>> -> how can i improve the code so it does not uses so much memory?
>> -> the information i'm using at this step comes basically from
>> -> dim(getExpData(x, "exprs"))
>
>
> I have not tried to reproduce this yet for lack of time. But it
> seems to me that the principle we need to establish here is:
> for any massive data structure, we need to put relevant metadata in slots,
> and interrogate only those slots. I don't know what dim() or getExpData()
> are doing, but my guess is that they are making some copies of something
> that they shouldn't need. you mention an issue with str() also -- now
> perhaps we need to write an oligobatch method for str that doesn't
> poke around too much? not sure
>
> Let's put the necessary dimension data in slots and be sure to update
> those slots whenever subsetting is done. And anything that show() needs
> should likewise be available without doing anything to the potentially
> massive datastructures.
>
> A couple of other points:
> 1) I noticed that a pdmapping environment has X and Y as vectors of integers.
> These are pretty big. Is it possible to use i2xy and xy2i software to get
> rid of these completely? these functions can be put into the environment,
> and the necessary offsets can be updated whenever a subset is done using
> a closure construct
> 2) installed package footprints with large .rda structures can be enormous, approaching
> 1GB. We can use save(...,compress=TRUE) to reduce the installed footprint
> and the usage overhead at load time seems quite acceptable. I got the
> pdmapping50khind240.rda down from 440MB to 60MB with this method. I understand
> that compress=TRUE has no impact on the compressed preinstallation package size.
> I am concerned about postinstall footprints.
>
>
>>>gc()
>>
>> -> back to 2GB
>>
>>in the above, 'x' is an oligoBatch object (which contains eSet, details at the
>>end of this message).
>>
>>any suggestion?
>>
>>thanks a lot,
>>
>>benilton
>>
>>ps: i just noticed that using dim(exprs(x)) in show() reduces the memory usage
>>from 6GB to 3.5GB... and using str(x) increases it to 10.5GB.
>>
>>-----------------------------------------------------------------------------
>>R version 2.2.0, 2005-07-26, x86_64-unknown-linux-gnu
>>
>>attached base packages:
>>[1] "tools" "methods" "stats" "graphics" "grDevices" "utils"
>>[7] "datasets" "base"
>>
>>other attached packages:
>> oligo reposTools Biobase
>> "0.0.7" "1.6.0" "1.6.6"
>>-------------------------------------------------------------------------------
>>
>>
>>>str(x)
>>
>>Formal class 'oligoBatch' [package "oligo"] with 8 slots
>> ..@ manufacturer: chr "Affymetrix"
>> ..@ platform : chr "Mapping50K_Hind240"
>> ..@ eList :Formal class 'exprList' [package "Biobase"] with 2 slots
>> .. .. ..@ eMetadata:`data.frame': 0 obs. of 0 variables
>> .. .. ..@ eList :List of 1
>> .. .. .. ..$ exprs: num [1:2560000, 1:90] 1369 65472 ...
>> .. .. .. .. ..- attr(*, "dimnames")=List of 2
>> .. .. .. .. .. ..$ : NULL
>> .. .. .. .. .. ..$ : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>> ..@ description :Formal class 'MIAME' [package "Biobase"] with 11 slots
>> .. .. ..@ name : chr ""
>> .. .. ..@ lab : chr ""
>> .. .. ..@ contact : chr ""
>> .. .. ..@ title : chr ""
>> .. .. ..@ abstract : chr ""
>> .. .. ..@ url : chr ""
>> .. .. ..@ samples : list()
>> .. .. ..@ hybridizations: list()
>> .. .. ..@ normControls : list()
>> .. .. ..@ preprocessing :List of 2
>> .. .. .. ..$ filenames : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>> .. .. .. ..$ oligoversion: chr NA
>> .. .. ..@ other : list()
>> ..@ annotation : chr ""
>> ..@ sampleNames : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>> ..@ notes : chr ""
>> ..@ phenoData :Formal class 'phenoData' [package "Biobase"] with 3 slots
>> .. .. ..@ pData :`data.frame': 90 obs. of 1 variable:
>> .. .. .. ..$ sample: int [1:90] 1 2 3 4 5 6 7 8 9 10 ...
>> .. .. ..@ varLabels :List of 1
>> .. .. .. ..$ sample: chr "arbitrary numbering"
>> .. .. ..@ varMetadata:`data.frame': 0 obs. of 0 variables
>>
--
Best regards
Wolfgang
-------------------------------------
Wolfgang Huber
European Bioinformatics Institute
European Molecular Biology Laboratory
Cambridge CB10 1SD
England
Phone: +44 1223 494642
Fax: +44 1223 494486
Http: www.ebi.ac.uk/huber
More information about the Bioc-devel
mailing list