[BioC] limma with missing values
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Mar 8 03:13:37 CET 2008
Dear Hans-Ulrich,
The output that you give does not match your code. The correct output is
given below.
limma treats missing values in exactly the same way as other linear model
function in R, such as lm(), glm() etc. For each gene, the rows with
missing values are removed both from the data and design matrix, then
non-estimable coefficients are removed (given NA) from the end of the
design matrix. See the documentation for those functions for more
information.
The ideal solution is not to introduce missing values into your data in
the first place. In my experimence, missing values are almost always
avoidable. I have never seen a situation where it was necessary or
desirable to introduce a large proportion of missing values.
Best wishes
Gordon
> library(limma)
> df = data.frame(
+ g1=c(5,5,6,6,NA,NA),
+ g2=c(5,5,NA,NA,6,6),
+ g3=c(NA,NA,5,5,6,6),
+ treat=c("T1","T1","T2","T2","T3","T3"),
+ class=c("A","B","A","B","A","B"))
> conts = list(treat="contr.sum", class="contr.treatment")
> X = t(df[,1:3])
> D = model.matrix( ~ class + treat, data=df, contrasts=conts)
> Fit = lmFit(X, D)
> options(digits=3)
> Fit$coef
(Intercept) classB treat1 treat2
g1 6.0 3.15e-16 -1.0 NA
g2 5.5 3.15e-16 -0.5 NA
g3 5.0 2.31e-16 -1.0 NA
> conts = list(treat="contr.sum", class="contr.sum")
> D = model.matrix( ~ class + treat, data=df, contrasts=conts)
> Fit = lmFit(X, D)
> Fit$coef
(Intercept) class1 treat1 treat2
g1 6.0 -1.80e-16 -1.0 NA
g2 5.5 -1.80e-16 -0.5 NA
g3 5.0 -1.39e-16 -1.0 NA
> sessionInfo()
R version 2.6.2 (2008-02-08)
i386-pc-mingw32
locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] limma_2.12.0
> Date: Thu, 06 Mar 2008 13:43:34 +0100
> From: Hans-Ulrich Klein <h.klein at uni-muenster.de>
> Subject: [BioC] limma with missing values
> To: bioconductor at stat.math.ethz.ch
>
> Dear all,
>
> I am struggling with using limma in the case of many missing values.
> Here is a small example:
>
> df = data.frame(
> g1=c(5,5,6,6,NA,NA),
> g2=c(5,5,NA,NA,6,6),
> g3=c(NA,NA,5,5,6,6),
> treat=c("T1","T1","T2","T2","T3","T3"),
> class=c("A","B","A","B","A","B"))
>
> conts = list(treat="contr.sum", class="contr.treatment")
>
> X = t(df[,1:3])
> D = model.matrix( ~ class + treat, data=df, contrasts=conts)
> Fit = lmFit(X, D)
>
>
> These are the results:
>
> > Fit$coefficients
> (Intercept) classB treatT2 treatT3
> g1 5 2.220446e-16 1 NA
> g2 5 2.220446e-16 NA 1
> g3 6 3.330669e-16 -1 NA
>
> Suppose I am interessted in the comparison of T2-T1. The column
> "treatT2" of "Fit$coefficients" looks fine for gene 1 and gene 2. In the
> case of gene 3 the baseline level T1 can not be estimated and the
> coefficient treatT2 seems to be the comparison of T2-T3. I find this
> confusing when computing many thousand models and applying toptable(Fit,
> coef="treatT2") afterwards.
>
> I changed the design matrix by setting
> conts = list(treat="contr.sum", class="contr.sum")
> Now, I get the results:
>
> > Fit$coefficients
> (Intercept) class1 treat1 treat2
> g1 6.0 -1.665335e-16 -1.0 NA
> g2 5.5 -1.665335e-16 -0.5 NA
> g3 5.0 -1.110223e-16 -1.0 NA
>
>
> However, I expected:
> Intercept treat1 treat2
> g1 5.5 -0.5 0.5
> g2 5.5 -0.5 NA
> g3 5.5 NA -0.5
>
>
> Is it appropriate to use limma in such scenarios? Did I something wrong
> or misinterpreted the results? I think, replacing the NAs by mean
> intensities would be a solution in this case.
>
> Regards,
> Hans-Ulrich
>
>
>
> > sessionInfo()
> R version 2.6.2 (2008-02-08)
> x86_64-pc-linux-gnu
>
> locale:
> C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] limma_2.12.0
>
> loaded via a namespace (and not attached):
> [1] rcompgen_0.1-17
More information about the Bioconductor
mailing list