[BioC] Hi memory usage...problems

Thu Oct 30 21:49:05 MET 2003

We are trying to load around 260 cel files in a AffyBatch object. There are 6117 genes in the chip.
We tried to load in both Windows and Linux machines. The problem is that we cannot allocate "memory" to load this data. 

In windows with 512Mb & 4Gb of VirtualMem, we can manipulate the amount of memory available to R  but when R process is around 1.6Gb the R process simply "never end" or "hang" (it is a known problem - R documentation).

In Linux with 2Gb & 4Gb of VirtualMem we ecountered the "cannot allocate around 470Mb of vector".

We sucesfully load the data making some "workarounds" but I think the "read.affybatch" routine can be enhanced and/or AffyBatch object can be "modified" in some way to resolve this kind of problems.

We "analyze" the read.affybatch routine and we saw that:
1) An affybatch object is created  before reading the all cel files
2) All cel files are read and saved in an "temporal" matrix.
3) The temporal matrix is "copied" (or assigned ?) to affybatch object.

We think that the 3 steps above consume more memory.
We think that if the affybatch object would have a method to load or replace just 1 column of the matrix (that should be a CEL file) the amount of memory necesary to load all data would be significant lower. This is because in the current process of read.affybatch the amount of necesary memory is twice than the final affybatch object really consume.

We couldn't load the 260 cel files but 210 was successfully loaded following these steps:
1) use the routine called "vivo.read.affybatch" to load the data into a matrix (see below).
2) save the data on 1
3) in a new session of R, we load the saved data on 2 and used to create an affybatch object.
4) the affybatch object  from 3 was saved
5) in a new session or R, we load the affybatch saved on 4 
6) proceed with the analysis.

The memory used after step 5) was the half than memory used after the step 4).

Any comments are welcome,

Regards,
Victor Trevino-Alvarado
vmt359 at bham.ac.uk

# for reference, this is closer to a "copy-paste version" of the read.affybatch routine

library(affy)
vivo.read.affybatch <- function (..., filenames = character(0), phenoData =
new("phenoData"),
    description = NULL, notes = "", compress =
getOption("BioC")$affy$compress.cel,
    rm.mask = FALSE, rm.outliers = FALSE, rm.extra = FALSE, verbose =
FALSE)
{
    auxnames <- as.list(substitute(list(...)))[-1]
    filenames <- .Primitive("c")(filenames, auxnames)
    n <- length(filenames)
    if (n == 0)
        stop("No file name given !")
    pdata <- pData(phenoData)
    if (dim(pdata)[1] != n) {
        warning("Incompatible phenoData object. Created a new one.\n")
        samplenames <- sub("^/?([^/]*/)*", "", unlist(filenames),
            extended = TRUE)
        pdata <- data.frame(sample = 1:n, row.names = samplenames)
        phenoData <- new("phenoData", pData = pdata, varLabels =
list(sample = "arbitrary numbering"))
    }
     }
    else samplenames <- rownames(pdata) 
    if (is.null(description)) {
        description <- new("MIAME")
        description at preprocessing$filenames <- filenames
        description at preprocessing$affyversion <- library(help =
affy)$info[[2]][[2]][2]
    }
    if (verbose)
        cat(1, "reading", filenames[[1]], "...")
    cel <- read.celfile(filenames[[1]], compress = compress,
        rm.mask = rm.mask, rm.outliers = rm.outliers, rm.extra = rm.extra)
    if (verbose)
        cat("done.\n")
    dim.intensity <- dim(intensity(cel))
    ref.cdfName <- cel at cdfName
    if (verbose) cat("Instanciating the array...")
    ival <- array(0, dim = c(prod(dim.intensity), n), dimnames = list(NULL, samplenames)) #intensity(conty)
    cat("done!\n")
    ival[, 1] <- c(intensity(cel))
    for (i in (1:n)[-1]) {
        if (verbose)
            cat(i, "reading", filenames[[i]], "...")
        cel <- read.celfile(filenames[[i]], compress = compress,
            rm.mask = rm.mask, rm.outliers = rm.outliers, rm.extra =
rm.extra)
        if (any(dim(intensity(cel)) != dim.intensity))
            stop(paste("CEL file dimension mismatch !\n(file",
                filenames[[i]], ")"))
        if (verbose)
            cat("done.\n")
        if (cel at cdfName != ref.cdfName)
            warning(paste("\n***\nDetected a mismatch of the cdfName: found ",
            cel at cdfName, ", expected ", ref.cdfName, "\nin file number ",
                i, " (", filenames[[i]], ")\n", "Please make sure all cel files belong to the same chip type!\n***\n",
        ival[, i] <- c(intensity(cel))
    }
    if (verbose)
        cat(paste("instanciating an AffyBatch (intensity a ", 
            prod(dim.intensity), "x", length(filenames), " matrix)...",
            sep = ""))
     if (verbose)
        cat("done.\n")
    return(ival)
}

#-------------- Step 1 & 2 (new session)
values <- vivo.read.affybatch(<your tipical parameters>)
save(values, file="values.RData", compress=T)

#------------- Step 3 & 4 (new session)
load("values.RData")
ab <- new("AffyBatch", exprs = values, cdfName = "your cdf",
        phenoData = a new phenoData, nrow = dim(values)[1], ncol =
dim(values)[2])
save(ab, file="ab.RData", compress=T)

#------------- Step 5 (new session)
load("ab.RData")

# Proceed with your analysis