[R] help with memory greedy storage
Bill Vinyard
wcvinyard at earthlink.net
Sat May 15 02:56:20 CEST 2004
Real rough estimate ... looks like you're trying to store about 38 million
numbers in the data frame. Do you need all of the models in the dataframe
at the end or are you just trying to generate the output and look at it
later?
Perhaps you could save intermediate results to file, that is, create a
separate file for each gene model, or after each set of n gene models.
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of
Arne.Muller at aventis.com
Sent: Friday, May 14, 2004 19:45
To: r-help at stat.math.ethz.ch
Subject: [R] help with memory greedy storage
Hello,
I've a problem with a self written routine taking a lot of memory (>1.2Gb).
Maybe you can suggest some enhancements, I'm pretty sure that my
implementation is not optimal ...
I'm creating many linear models and store coefficients, anova p-values ...
all I need in different lists which are then finally returned in a list
(list of lists).
The input is a matrix with 84 rows and >100,000 rows. The routine probeDf
below creates a data frame that assigns the 84 rows to the different
factors, but not just for one row but for several rows, depending what
which(rows == g),] returns, and a new factor ('probe') is generated. This
results in a 1344 by 6 data frame.
Example data frame returned by probeDf:
Value batch time dose array probe
1 2.317804 NEW 24h 000mM 1 1
2 2.495390 NEW 24h 000mM 2 1
3 2.412247 NEW 24h 000mM 3 1
...
144 8.851469 OLD 04h 100mM 60 2
145 8.801430 PRG 24h 000mM 61 2
146 8.308224 PRG 24h 000mM 62 2
...
This data frame is not the problem since, it gets generated on-the-fly per
gene and is discarded afterwards (just that it takes some time to generate
it).
Here comes the problematic routine:
### emat: matrix, model: formular for lm, contr: optional contrasts
probe.fit <- function(emat, factors, model, contr=NULL)
{
rows <- rownames(emat)
genes <- unique(rows)
l <- length(genes)
### generate proper lables (names) for the anova p-values
difflabels <- attr(terms(model),"term.labels")
aov <- list() # anova p-values for factors + interactions
coef <- list() # lm coefficients
coefp <- list() # p-valuies for coefficients
rsq <- list() # R-squared of fit
fitted <- list() # fitted values
value <- list() # orig. values (used with fitted to get residuals)
for ( g in genes ) { # loop over >12,000 genes
### g is the name that identifies 14 to 16 rows in emat
### d is the data frame for the lm
d <- probeDf(emat[which(rows == g),], facts)
fit <- lm(model, data = d, contrasts=contr)
fit.sum <- summary(fit)
aov[[g]] <- as.vector(na.omit(anova(fit)$'Pr(>F)'))
names(aov[[g]]) <- difflabels
coef[[g]] <- coef(fit)[-1]
coefp[[g]] <- coef(fit.sum)[-1,'Pr(>|t|)']
rsq[[g]] <- fit.sum$'r.squared'
value[[g]] <- d$Value
fitted[[g]] <- fitted(fit)
}
list(aov=aov, coefs=coef, coefp=coefp, rsq=rsq,
fitted=fitted, values=values)
}
### create a data frame from a matrix (usually 16 rows and 84 columns)
### and a list of factors. Basically this repates the factors 16 times
### (for each row in the matrix). This results in a data frame with 84*16
### rows as many columns as there are factors + 2 (probe factor + value
### to be modeled later)
probeDf <- function(emat, facts) {
df <- NULL
n <- 1
nsamp <- ncol(emat)
for ( i in 1:nrow(emat) ) {
values <- c(t(emat[i,]))
df.new <- data.frame(Value = values, facts, probe = rep(n, nsamp))
n <- n + 1
if ( !is.null(df) ) {
df <- rbind(df, df.new)
} else {
df <- df.new
}
}
df$probe <- as.factor(df$probe)
df
}
If I remove coef, coefp, value and fitted from the loop in probe.fit the
memory usage is moderate.
The problem is that each of the 12,000 genes contributes 148 coefficients
(the model contains quite a few factors) and p-values, the fitted and value
vectors are >1300 elements long. I couldn't find a more compact form of
storage that I is still easy to explore afterwards.
Suggestions on how to get this done more efficiently (in terms of memory)
are greatfully received.
kind regards,
Arne
--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list