[Bioc-devel] strange behavior on memory usage (fwd)

Mon Aug 22 19:39:14 CEST 2005

> hi everyone,
>
> i was wondering if anybody could give me a hint of what causes a strange
> behavior on memory usage when using oligo/makePlatformDesign packages.
>
> i'm reading a bunch of (affy) SNP chips:
>
> > x = read.celfiles(list.celfiles())
>      -> at this point the R process uses around 2GB
>      -> which does not look bad, since i'm reading 90 samples
> > show(x)
>      -> now the R process uses around 6GB
>      -> how can i improve the code so it does not uses so much memory?
>      -> the information i'm using at this step comes basically from
>      ->       dim(getExpData(x, "exprs"))

I have not tried to reproduce this yet for lack of time.  But it
seems to me that the principle we need to establish here is:
for any massive data structure, we need to put relevant metadata in slots,
and interrogate only those slots.  I don't know what dim() or getExpData()
are doing, but my guess is that they are making some copies of something
that they shouldn't need.  you mention an issue with str() also -- now
perhaps we need to write an oligobatch method for str that doesn't
poke around too much?  not sure

Let's put the necessary dimension data in slots and be sure to update
those slots whenever subsetting is done.  And anything that show() needs
should likewise be available without doing anything to the potentially
massive datastructures.

A couple of other points:
1) I noticed that a pdmapping environment has X and Y as vectors of integers.
These are pretty big.  Is it possible to use i2xy and xy2i software to get
rid of these completely?  these functions can be put into the environment,
and the necessary offsets can be updated whenever a subset is done using
a closure construct
2) installed package footprints with large .rda structures can be enormous, approaching
1GB.  We can use save(...,compress=TRUE) to reduce the installed footprint
and the usage overhead at load time seems quite acceptable.  I got the
pdmapping50khind240.rda down from 440MB to 60MB with this method.  I understand
that compress=TRUE has no impact on the compressed preinstallation package size.
I am concerned about postinstall footprints.

> > gc()
>      -> back to 2GB
>
> in the above, 'x' is an oligoBatch object (which contains eSet, details at the
> end of this message).
>
> any suggestion?
>
> thanks a lot,
>
> benilton
>
> ps: i just noticed that using dim(exprs(x)) in show() reduces the memory usage
> from 6GB to 3.5GB... and using str(x) increases it to 10.5GB.
>
> -----------------------------------------------------------------------------
> R version 2.2.0, 2005-07-26, x86_64-unknown-linux-gnu
>
> attached base packages:
> [1] "tools"     "methods"   "stats"     "graphics"  "grDevices" "utils"
> [7] "datasets"  "base"
>
> other attached packages:
>       oligo reposTools    Biobase
>     "0.0.7"    "1.6.0"    "1.6.6"
> -------------------------------------------------------------------------------
>
> > str(x)
> Formal class 'oligoBatch' [package "oligo"] with 8 slots
>    ..@ manufacturer: chr "Affymetrix"
>    ..@ platform    : chr "Mapping50K_Hind240"
>    ..@ eList       :Formal class 'exprList' [package "Biobase"] with 2 slots
>    .. .. ..@ eMetadata:`data.frame':     0 obs. of  0 variables
>    .. .. ..@ eList    :List of 1
>    .. .. .. ..$ exprs: num [1:2560000, 1:90]  1369 65472  ...
>    .. .. .. .. ..- attr(*, "dimnames")=List of 2
>    .. .. .. .. .. ..$ : NULL
>    .. .. .. .. .. ..$ : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>    ..@ description :Formal class 'MIAME' [package "Biobase"] with 11 slots
>    .. .. ..@ name          : chr ""
>    .. .. ..@ lab           : chr ""
>    .. .. ..@ contact       : chr ""
>    .. .. ..@ title         : chr ""
>    .. .. ..@ abstract      : chr ""
>    .. .. ..@ url           : chr ""
>    .. .. ..@ samples       : list()
>    .. .. ..@ hybridizations: list()
>    .. .. ..@ normControls  : list()
>    .. .. ..@ preprocessing :List of 2
>    .. .. .. ..$ filenames   : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>    .. .. .. ..$ oligoversion: chr NA
>    .. .. ..@ other         : list()
>    ..@ annotation  : chr ""
>    ..@ sampleNames : chr [1:90] "NA06985_Hind_B5_3005533.CEL" ...
>    ..@ notes       : chr ""
>    ..@ phenoData   :Formal class 'phenoData' [package "Biobase"] with 3 slots
>    .. .. ..@ pData      :`data.frame':   90 obs. of  1 variable:
>    .. .. .. ..$ sample: int [1:90] 1 2 3 4 5 6 7 8 9 10 ...
>    .. .. ..@ varLabels  :List of 1
>    .. .. .. ..$ sample: chr "arbitrary numbering"
>    .. .. ..@ varMetadata:`data.frame':   0 obs. of  0 variables
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>