[R] Multiple imputation, especially in rms/Hmisc packages
Mark Seeto
Mark.Seeto at nal.gov.au
Wed Aug 11 02:16:37 CEST 2010
Hi Frank (and others),
Thank you for your comments in reply to my questions. I had not
encountered contrast tests before. I've looked in a few texts, but the
only place I could find anything about contrast tests was your
Regression Modeling Strategies book.
You wrote that the "leave some variables out" F-test and the contrast
tests are equivalent in ordinary regression, so I have tried to verify
this. However, I got different results for the two tests, so I'm hoping
you can point out what I'm doing incorrectly.
The null hypothesis is CB = 0, where C is the contrast matrix and B is
the vector of regression coefficients. I'm using the test statistic on
p. 189 of RMS:
W = (Cb)'(CVC')^{-1}(Cb), with b being the vector of coefficient
estimates and V being its covariance matrix.
The model is y ~ rcs(x1,3) + x2, and I want to test the null hypothesis
that the two x1 coefficients are both zero. This is equivalent to CB =
0, with C being the matrix
[0 1 0 0 (first row); 0 0 1 0 (second row)].
library(rms)
set.seed(1)
x1 <- rnorm(50)
x2 <- rnorm(50)
y <- 1 + x1 + x2 + rnorm(50,0,3)
ols.full <- ols(y ~ rcs(x1,3) + x2) # full model
ols.red <- ols(y ~ x2) # reduced model
RSS.full <- sum((ols.full$fitted - y)^2) # residual sums of squares
RSS.red <- sum((ols.red$fitted - y)^2)
F <- ((RSS.red - RSS.full)/(48 - 46))/(RSS.full/46) # test statistic
for F-test
F
[1] 2.576335
1 - pf(F, 2, 46)
[1] 0.08698798
anova(ols.full)
Analysis of Variance Response: y
Factor d.f. Partial SS MS F P
x1 2 39.95283222 19.97641611 2.58 0.0870
Nonlinear 1 0.07360774 0.07360774 0.01 0.9228
x2 1 47.17322335 47.17322335 6.08 0.0174
REGRESSION 3 83.79617922 27.93205974 3.60 0.0203
ERROR 46 356.67539200 7.75381287
anova(ols.full)["x1", "F"]
[1] 2.576335 # This confirms that the result of anova is the same as
that of the "leave some variables out" F-test.
V <- ols.full$var
C <- matrix(c(0,1,0,0, 0,0,1,0), nrow=2, byrow=TRUE)
b <- ols.full$coef
W <- t(C %*% b) %*% solve(C %*% V %*% t(C)) %*% (C %*% b)
1 - pchisq(W, 2)
[,1]
[1,] 0.07605226
Have I used the correct distribution for W? Should I expect the p-values
from the two tests to be exactly the same?
Thanks in advance; I really appreciate any help you can give.
Mark
Frank Harrell wrote:
>
> On Mon, 9 Aug 2010, Mark Seeto wrote:
>
>> Hello, I have a general question about combining imputations as well
>> as a
>> question specific to the rms and Hmisc packages.
>>
>> The situation is multiple regression on a data set where multiple
>> imputation has been used to give M imputed data sets. I know how to get
>> the combined estimate of the covariance matrix of the estimated
>> coefficients (average the M covariance matrices from the individual
>> imputations and add (M+1)/M times the between-imputation covariance
>> matrix), and I know how to use this to get p-values and confidence
>> intervals for the individual coefficients.
>>
>> What I don't know is how to combine imputations to obtain the multiple
>> degree-of-freedom tests to test whether a set of coefficients are all 0.
>> If there were no missing values, this would be done using the "general
>> linear test" which involves fitting the full model and the reduced
>> model and getting an F statistic based on the residual sums of
>> squares and
>> degrees of freedom. How does one correctly do this when there are
>> multiple
>> imputations? In the language of Hmisc, I'm asking how the values in
>> anova(fmi) are calculated if fmi is the result of fit.mult.impute (I've
>> looked at the anova.rms code but couldn't understand it).
>
> In ordinary regression, the "leave some variables out and compare
> R^2"-based F-test and the quicker contrast tests are equivalent. In
> general we use the latter, which is expedient since it is easy to
> estimate the imputation-corrected covariance matrix. In the rms and
> Hmisc packages, fit.mult.impute creates the needed fit object, then
> anova, contrast, summary, etc. give the correct Wald tests.
>
>
> >
>> The second question I have relates to rms/Hmisc specifically. When I use
>> fit.mult.impute, the fitted values don't appear to match the values
>> given
>> by the predict function on the original data. Instead, they match the
>> fitted values of the last imputation. For example,
>>
>> library(rms)
>> set.seed(1)
>> x1 <- rnorm(40)
>> x2 <- 0.5*x1 + rnorm(40,0,3)
>>
>> y <- x1^2 + x2 + rnorm(40,0,3)
>> x2[40] <- NA # create 1 missing value in x2
>> a <- aregImpute(~ y + x1 + x2, n.impute=2, nk=3) # 2 imputations
>>
>> x2.imp1 <- x2; x2.imp1[40] <- a$imputed$x2[,1] # first imputed x2
>> vector
>> x2.imp2 <- x2; x2.imp2[40] <- a$imputed$x2[,2] # second imputed x2
>> vector
>>
>> ols.imp1 <- ols(y ~ rcs(x1,3) + x2.imp1) # model on imputation 1
>> ols.imp2 <- ols(y ~ rcs(x1,3) + x2.imp2) # model on imputation 2
>>
>> d <- data.frame(y, x1, x2)
>> fmi <- fit.mult.impute(y ~ rcs(x1,3) + x2, ols, a, data=d)
>>
>> fmi$fitted
>> predict(fmi, d)
>>
>> I would have expected the results of fmi$fitted and predict(fmi,d) to be
>> the same except for the final NA, but they are not. Instead,
>> fmi$fitted is
>> the same as ols.imp2$fitted. Also,
>>
>> anova(ols.imp2)
>> anova(fmi)
>>
>> unexpectedly give the same result in the ERROR line. I would be most
>> grateful if anyone could explain this to me.
>
> I need to add some more warnings. For many of the calculations, the
> last imputation is used.
>
> Frank
>
>>
>> Thanks,
>> Mark
>> --
>> Mark Seeto
>> Statistician
>>
>> National Acoustic Laboratories <http://www.nal.gov.au/>
>> A Division of Australian Hearing
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
More information about the R-help
mailing list