[BioC] limma with missing values

Gordon K Smyth smyth at wehi.EDU.AU
Sat Mar 8 03:13:37 CET 2008


Dear Hans-Ulrich,

The output that you give does not match your code.  The correct output is 
given below.

limma treats missing values in exactly the same way as other linear model 
function in R, such as lm(), glm() etc.  For each gene, the rows with 
missing values are removed both from the data and design matrix, then 
non-estimable coefficients are removed (given NA) from the end of the 
design matrix.  See the documentation for those functions for more 
information.

The ideal solution is not to introduce missing values into your data in 
the first place.  In my experimence, missing values are almost always 
avoidable.  I have never seen a situation where it was necessary or 
desirable to introduce a large proportion of missing values.

Best wishes
Gordon


> library(limma)
> df = data.frame(
+   g1=c(5,5,6,6,NA,NA),
+   g2=c(5,5,NA,NA,6,6),
+   g3=c(NA,NA,5,5,6,6),
+   treat=c("T1","T1","T2","T2","T3","T3"),
+   class=c("A","B","A","B","A","B"))
>   conts = list(treat="contr.sum", class="contr.treatment")
>   X = t(df[,1:3])
>   D = model.matrix( ~ class + treat, data=df, contrasts=conts)
>   Fit = lmFit(X, D)
> options(digits=3)
> Fit$coef
    (Intercept)   classB treat1 treat2
g1         6.0 3.15e-16   -1.0     NA
g2         5.5 3.15e-16   -0.5     NA
g3         5.0 2.31e-16   -1.0     NA
> conts = list(treat="contr.sum", class="contr.sum")
>   D = model.matrix( ~ class + treat, data=df, contrasts=conts)
>   Fit = lmFit(X, D)
> Fit$coef
    (Intercept)    class1 treat1 treat2
g1         6.0 -1.80e-16   -1.0     NA
g2         5.5 -1.80e-16   -0.5     NA
g3         5.0 -1.39e-16   -1.0     NA
> sessionInfo()
R version 2.6.2 (2008-02-08)
i386-pc-mingw32

locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_2.12.0



> Date: Thu, 06 Mar 2008 13:43:34 +0100
> From: Hans-Ulrich Klein <h.klein at uni-muenster.de>
> Subject: [BioC] limma with missing values
> To: bioconductor at stat.math.ethz.ch
>
> Dear all,
>
> I am struggling with using limma in the case of many missing values.
> Here is a small example:
>
> df = data.frame(
>  g1=c(5,5,6,6,NA,NA),
>  g2=c(5,5,NA,NA,6,6),
>  g3=c(NA,NA,5,5,6,6),
>  treat=c("T1","T1","T2","T2","T3","T3"),
>  class=c("A","B","A","B","A","B"))
>
>  conts = list(treat="contr.sum", class="contr.treatment")
>
>  X = t(df[,1:3])
>  D = model.matrix( ~ class + treat, data=df, contrasts=conts)
>  Fit = lmFit(X, D)
>
>
> These are the results:
>
> >   Fit$coefficients
>   (Intercept)       classB treatT2 treatT3
> g1           5 2.220446e-16       1      NA
> g2           5 2.220446e-16      NA       1
> g3           6 3.330669e-16      -1      NA
>
> Suppose I am interessted in the comparison of T2-T1. The column
> "treatT2" of "Fit$coefficients" looks fine for gene 1 and gene 2. In the
> case of gene 3 the baseline level T1 can not be estimated and the
> coefficient treatT2 seems to be the comparison of T2-T3. I find this
> confusing when computing many thousand models and applying toptable(Fit,
> coef="treatT2") afterwards.
>
> I changed the design matrix by setting
> conts = list(treat="contr.sum", class="contr.sum")
> Now, I get the results:
>
> >   Fit$coefficients
>   (Intercept)        class1 treat1 treat2
> g1         6.0 -1.665335e-16   -1.0     NA
> g2         5.5 -1.665335e-16   -0.5     NA
> g3         5.0 -1.110223e-16   -1.0     NA
>
>
> However, I expected:
>    Intercept    treat1   treat2
> g1   5.5              -0.5     0.5
> g2   5.5              -0.5     NA
> g3   5.5              NA      -0.5
>
>
> Is it appropriate to use limma in such scenarios? Did I something wrong
> or misinterpreted the results? I think, replacing the NAs by mean
> intensities would be a solution in this case.
>
> Regards,
> Hans-Ulrich
>
>
>
> > sessionInfo()
> R version 2.6.2 (2008-02-08)
> x86_64-pc-linux-gnu
>
> locale:
> C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] limma_2.12.0
>
> loaded via a namespace (and not attached):
> [1] rcompgen_0.1-17



More information about the Bioconductor mailing list