[R] Surprising results from summary(lm()) on data with NO variation
Warnes, Gregory R
gregory_r_warnes at groton.pfizer.com
Fri Dec 13 21:34:03 CET 2002
I have some data (from Affymetrix experiments) where I fit an aov() model to
a large number of outcome variables. A reasonable fraction of the outcome
variables have 0 variability because values below a cutoff have been
replaced with the cutoff value (often 20) .
In this case, the overall p-value from summary(lm(..)) is misleadingly
small:
For example (equivalent results in R 1.6.1 on both Solaris and for Windows
2000):
> x <- rep(20, 4+5+5) # no variation
> Treatment <- factor(rep( c("Control","Low","High"), c(4,5,5) )) #
3 treatment levels
> summary( lm( x ~ Treatment) )
Call:
lm(formula = x ~ Treatment)
Residuals:
Min 1Q Median 3Q Max
-1.994e-14 7.889e-31 7.889e-31 7.889e-31 6.647e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.000e+01 3.471e-15 5.762e+15 <2e-16 ***
TreatmentHigh 6.647e-15 4.657e-15 1.427e+00 0.181
TreatmentLow 6.647e-15 4.657e-15 1.427e+00 0.181
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 6.942e-15 on 11 degrees of freedom
Multiple R-Squared: 0.5333, Adjusted R-squared: 0.4485
F-statistic: 6.286 on 2 and 11 DF, p-value: 0.01512
^^^^^^^^^^^^^^^^
Note F statistic and the overall p-value.
For some values (eg 0 and 7), fitting this model will generate NA's for the
estimates. For most values, we get similar results. I presume that this is
due to numerical problems in the fitting algorithm. Still the result is
quite surprising.
It appears that summary.aov doesn't have the same problem:
> x <- rep(20, 4+5+5) # no variation
> Treatment <- factor(rep( c("Control","Low","High"), c(4,5,5) )) #
3 treatment levels
> summary( aov( x ~ Treatment) )
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 1.2622e-28 6.3110e-29 1.3095 0.3089
Residuals 11 5.3011e-28 4.8190e-29
Perhaps the mechanism for computing the F statistic that summary.aov uses is
less susceptible to the 'no variability' problem than the computation in
summary.lm.
I'm posting this here to raise a warning flag. For my particular
application, I'll just check for the no variability case and handle it
separately. I'll leave it to the R core folks to decide if this is a big
enough problem to do something about in the code.
-Greg
LEGAL NOTICE\ Unless expressly stated otherwise, this message is ... [[dropped]]
More information about the R-help
mailing list