[BioC] Limma: NA handling in contrasts.fit
Gordon K Smyth
smyth at wehi.EDU.AU
Wed Apr 1 01:44:05 CEST 2009
Hi Simon,
I understand your point, that this behaviour is inconsistent in limma. It
is however a deliberate decision rather than a bug.
The reasoning is that a contrast is defined by c'b where c is the contrast
vector and b is the coef vector. In matrix arithmetic in R, the answer is
NA if any of the elements of c or b are NA, even if the NA is multiplied
by a zero in the contrast.
In R arithmetic, Inf is a possible value, so that NA*0 must indeed be NA,
because Inf*0 is possible and that is NA. You could argue that limma does
not allow fit coefficients to be Inf, and hence NA*0 should be zero in
contrasts.fit. I decided a long time ago that this might be nice in
contrasts.fit, but would involve too much special-case programming which
would be confusing and would slow down the function unsatifactorily.
You can work around it for your problem like this. If you want to compute
a contrast which involves only a subset of coefficients, simply subset to
the coefficients you want before forming the contrast. In your second
example,
> fit2 <- fit[,c(3,4)]
> fit2 <- contrasts.fit( fit2, c(1,-1) )
etc does what you want.
I suspect it is impossible to program an entirely consistent behaviour in
general when some coefs become NA. For one thing, the meaning of
remaining coefs is usually changed. So I prefer to allow knowledgeable
users to handle it as above. I am open to other suggestions. In my own
microarray analysis, I always make sure this situation doesn't arise.
Best wishes
Gordon
On Tue, 31 Mar 2009, Simon Anders wrote:
> Dear Gordon et al.
>
> I've just noticed an oddity in the way how limma's contrasts.fit function
> handles missing observations, and I wonder if it might be a bug.
>
> Please have a look at the following test case:
>
> I use the this design matrix:
>
> > dm <- cbind( intercept=1, a=rep(c(0,1),each=2), b=rep(c(0,1), 2) )
> > dm
> intercept a b
> [1,] 1 0 0
> [2,] 1 0 1
> [3,] 1 1 0
> [4,] 1 1 1
>
> Let's construct sample data for 5 genes and put in two NAs as follows:
>
> > y <- t( replicate( 5, dm %*% runif(3) ) )
> > y[ 1, 3:4 ] <- NA
> > y
> [,1] [,2] [,3] [,4]
> [1,] 0.099221925 0.5628846 NA NA
> [2,] 0.009771325 0.7748060 0.3977409 1.162776
> [3,] 0.223688182 0.6330630 0.6791238 1.088499
> [4,] 0.957762805 1.4338553 1.4259875 1.902080
> [5,] 0.766597103 1.3905022 0.9947635 1.618669
>
> If I fit this data with lmFit, I unsurprisingly get a warning that some
> coefficients cannot be estimated:
>
> > library( limma )
> > fit <- lmFit( y, dm )
> Warning message:
> Partial NA coefficients for 1 probe(s)
>
> The missing coefficient is the coefficient 'a' for the first gene. Note that
> 'b' can be estimated:
>
> > coefficients( fit )
> intercept a b
> [1,] 0.099221925 NA 0.4636627
> [2,] 0.009771325 0.3879696 0.7650347
> [3,] 0.223688182 0.4554356 0.4093748
> [4,] 0.957762805 0.4682247 0.4760925
> [5,] 0.766597103 0.2281664 0.6239051
>
> I now ask 'contrast.fit' to boil the fit object down to only contain the "b"
> coefficients. I should get the same coefficients, but only the "b" column.
>
> > fit2 <- contrasts.fit( fit, coefficients="b" )
> > coefficients(fit2)
> b
> [1,] NA
> [2,] 0.7650347
> [3,] 0.4093748
> [4,] 0.4760925
> [5,] 0.6239051
>
> Indeed, I get the same values, but the first value is now masked as 'NA'.
>
> Is there a reason for this, or is this a bug?
>
> Granted, in this example, the use of 'make.contrasts' is superfluous, but in
> the following example, it is not. I place the NA, such that 'a' cannot be
> estimated, and I get an NA in the contrast 'b-c':
>
> > dm <- cbind( intercept=1, a=rep(c(0,1),each=2), b=rep(c(0,1), 2) )
> > dm <- rbind( cbind( dm, c=0 ), cbind( dm, c=1 ) )
> > dm
> intercept a b c
> [1,] 1 0 0 0
> [2,] 1 0 1 0
> [3,] 1 1 0 0
> [4,] 1 1 1 0
> [5,] 1 0 0 1
> [6,] 1 0 1 1
> [7,] 1 1 0 1
> [8,] 1 1 1 1
> > y <- t( replicate( 5, dm %*% runif(4) ) )
> > y[ 1, c(3,4,7,8) ] <- NA
> > fit <- lmFit( y, dm )
> Warning message:
> Partial NA coefficients for 1 probe(s)
> > coefficients( fit )
> intercept a b c
> [1,] 0.17989906 NA 0.66812435 0.54889675
> [2,] 0.26932489 0.9461271 0.77956666 0.05345084
> [3,] 0.38124957 0.0309537 0.58205980 0.26414381
> [4,] 0.50592287 0.9680045 0.86566150 0.96073095
> [5,] 0.01829934 0.1103156 0.09401078 0.02140402
> > fit2 <- contrasts.fit( fit, c( 0, 0, 1, -1 ) )
> > coefficients(fit2)
> [,1]
> [1,] NA
> [2,] 0.72611582
> [3,] 0.31791599
> [4,] -0.09506945
> [5,] 0.07260676
>
>
> Thanks in advance for any help in getting around this, as I have a lot of
> data with quite some missing values.
>
> Best regards
> Simon Anders
>
> > sessionInfo()
> R version 2.9.0 alpha (2009-03-24 r48212)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] limma_2.17.12
>
>
> +---
> | Dr. Simon Anders, Dipl. Phys.
> | European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
> | office phone +44-1223-492680, mobile phone +44-7505-841692
> | preferred (permanent) e-mail: sanders at fs.tum.de
>
>
>
>
More information about the Bioconductor
mailing list