[Bioc-devel] strange behavior on memory usage (fwd)

Wed Aug 24 02:43:23 CEST 2005

On Mon, 22 Aug 2005, Vincent Carey 525-2265 wrote:

>> hi everyone,
>>
>> i was wondering if anybody could give me a hint of what causes a strange
>> behavior on memory usage when using oligo/makePlatformDesign packages.
>>
>> i'm reading a bunch of (affy) SNP chips:
>>
>>> x = read.celfiles(list.celfiles())
>>      -> at this point the R process uses around 2GB
>>      -> which does not look bad, since i'm reading 90 samples
>>> show(x)
>>      -> now the R process uses around 6GB
>>      -> how can i improve the code so it does not uses so much memory?
>>      -> the information i'm using at this step comes basically from
>>      ->       dim(getExpData(x, "exprs"))
>
> I have not tried to reproduce this yet for lack of time.  But it
> seems to me that the principle we need to establish here is:
> for any massive data structure, we need to put relevant metadata in slots,
> and interrogate only those slots.  I don't know what dim() or getExpData()
> are doing, but my guess is that they are making some copies of something
> that they shouldn't need.  you mention an issue with str() also -- now
> perhaps we need to write an oligobatch method for str that doesn't
> poke around too much?  not sure
>
> Let's put the necessary dimension data in slots and be sure to update
> those slots whenever subsetting is done.  And anything that show() needs
> should likewise be available without doing anything to the potentially
> massive datastructures.
>
> A couple of other points:
> 1) I noticed that a pdmapping environment has X and Y as vectors of integers.
> These are pretty big.  Is it possible to use i2xy and xy2i software to get
> rid of these completely?  these functions can be put into the environment,
> and the necessary offsets can be updated whenever a subset is done using
> a closure construct

the problem here is nimblegen. its hard to explain, but there is no 
mapping for x,y to index.

> 2) installed package footprints with large .rda structures can be enormous, approaching
> 1GB.  We can use save(...,compress=TRUE) to reduce the installed footprint
> and the usage overhead at load time seems quite acceptable.  I got the
> pdmapping50khind240.rda down from 440MB to 60MB with this method.  I understand
> that compress=TRUE has no impact on the compressed preinstallation package size.
> I am concerned about postinstall footprints.
>

how much longer does it take to load though? my thinking was that i rather 
have a large object and faster loads.

>>> gc()
>>      -> back to 2GB
>>
>> in the above, 'x' is an oligoBatch object (which contains eSet, details at the
>> end of this message).
>>
>> any suggestion?
>>
>> thanks a lot,
>>
>> benilton
>>
>> ps: i just noticed that using dim(exprs(x)) in show() reduces the memory usage
>> from 6GB to 3.5GB... and using str(x) increases it to 10.5GB.
>>
>> -----------------------------------------------------------------------------
>> R version 2.2.0, 2005-07-26, x86_64-unknown-linux-gnu
>>
>> attached base packages:
>> [1] "tools"     "methods"   "stats"     "graphics"  "grDevices" "utils"
>> [7] "datasets"  "base"
>>
>> other attached packages:
>>       oligo reposTools    Biobase
>>     "0.0.7"    "1.6.0"    "1.6.6"
>> -------------------------------------------------------------------------------
>>
>>> str(x)
>> Formal class 'oligoBatch' [package "oligo"] with 8 slots
>>    ..@ manufacturer: chr "Affymetrix"
>>    ..@ platform    : chr "Mapping50K_Hind240"
>>    ..@ eList       :Formal class 'exprList' [package "Biobase"] with 2 slots
>>    .. .. ..@ eMetadata:`data.frame':     0 obs. of  0 variables
>>    .. .. ..@ eList    :List of 1
>>    .. .. .. ..$ exprs: num [1:2560000, 1:90]  1369 65472  ...
>>    .. .. .. .. ..- attr(*, "dimnames")=List of 2
>>    .. .. .. .. .. ..$ : NULL
>>    .. .. .. .. .. ..$ : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>>    ..@ description :Formal class 'MIAME' [package "Biobase"] with 11 slots
>>    .. .. ..@ name          : chr ""
>>    .. .. ..@ lab           : chr ""
>>    .. .. ..@ contact       : chr ""
>>    .. .. ..@ title         : chr ""
>>    .. .. ..@ abstract      : chr ""
>>    .. .. ..@ url           : chr ""
>>    .. .. ..@ samples       : list()
>>    .. .. ..@ hybridizations: list()
>>    .. .. ..@ normControls  : list()
>>    .. .. ..@ preprocessing :List of 2
>>    .. .. .. ..$ filenames   : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>>    .. .. .. ..$ oligoversion: chr NA
>>    .. .. ..@ other         : list()
>>    ..@ annotation  : chr ""
>>    ..@ sampleNames : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>>    ..@ notes       : chr ""
>>    ..@ phenoData   :Formal class 'phenoData' [package "Biobase"] with 3 slots
>>    .. .. ..@ pData      :`data.frame':   90 obs. of  1 variable:
>>    .. .. .. ..$ sample: int [1:90] 1 2 3 4 5 6 7 8 9 10 ...
>>    .. .. ..@ varLabels  :List of 1
>>    .. .. .. ..$ sample: chr "arbitrary numbering"
>>    .. .. ..@ varMetadata:`data.frame':   0 obs. of  0 variables
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>