[R] Multiple imputation, especially in rms/Hmisc packages

Wed Aug 11 02:16:37 CEST 2010

Hi Frank (and others),

Thank you for your comments in reply to my questions. I had not 
encountered contrast tests before. I've looked in a few texts, but the 
only place I could find anything about contrast tests was your 
Regression Modeling Strategies book.

You wrote that the "leave some variables out" F-test and the contrast 
tests are equivalent in ordinary regression, so I have tried to verify 
this. However, I got different results for the two tests, so I'm hoping 
you can point out what I'm doing incorrectly.

The null hypothesis is CB = 0, where C is the contrast matrix and B is 
the vector of regression coefficients. I'm using the test statistic on 
p. 189 of RMS:
W = (Cb)'(CVC')^{-1}(Cb), with b being the vector of coefficient 
estimates and V being its covariance matrix.

The model is y ~ rcs(x1,3) + x2, and I want to test the null hypothesis 
that the two x1 coefficients are both zero. This is equivalent to CB = 
0, with C being the matrix
[0 1 0 0 (first row); 0 0 1 0 (second row)].

library(rms)
set.seed(1)
x1 <- rnorm(50)
x2 <- rnorm(50)
y <- 1 + x1 + x2 + rnorm(50,0,3)

ols.full <- ols(y ~ rcs(x1,3) + x2) # full model
ols.red <- ols(y ~ x2) # reduced model

RSS.full <- sum((ols.full$fitted - y)^2) # residual sums of squares
RSS.red <- sum((ols.red$fitted - y)^2)

F <- ((RSS.red - RSS.full)/(48 - 46))/(RSS.full/46)  # test statistic 
for F-test
F
[1] 2.576335

1 - pf(F, 2, 46)
[1] 0.08698798

anova(ols.full)
                Analysis of Variance          Response: y

 Factor     d.f. Partial SS   MS          F    P    
 x1          2    39.95283222 19.97641611 2.58 0.0870
  Nonlinear  1     0.07360774  0.07360774 0.01 0.9228
 x2          1    47.17322335 47.17322335 6.08 0.0174
 REGRESSION  3    83.79617922 27.93205974 3.60 0.0203
 ERROR      46   356.67539200  7.75381287           

anova(ols.full)["x1", "F"]
[1] 2.576335    # This confirms that the result of anova is the same as 
that of the "leave some variables out" F-test.

V <- ols.full$var
C <- matrix(c(0,1,0,0, 0,0,1,0), nrow=2, byrow=TRUE)
b <- ols.full$coef
W <- t(C %*% b) %*% solve(C %*% V %*% t(C)) %*% (C %*% b)

1 - pchisq(W, 2)
           [,1]
[1,] 0.07605226

Have I used the correct distribution for W? Should I expect the p-values 
from the two tests to be exactly the same?

Thanks in advance; I really appreciate any help you can give.

Mark

Frank Harrell wrote:
>
> On Mon, 9 Aug 2010, Mark Seeto wrote:
>
>> Hello, I have a general question about combining imputations as well 
>> as a
>> question specific to the rms and Hmisc packages.
>>
>> The situation is multiple regression on a data set where multiple
>> imputation has been used to give M imputed data sets. I know how to get
>> the combined estimate of the covariance matrix of the estimated
>> coefficients (average the M covariance matrices from the individual
>> imputations and add (M+1)/M times the between-imputation covariance
>> matrix), and I know how to use this to get p-values and confidence
>> intervals for the individual coefficients.
>>
>> What I don't know is how to combine imputations to obtain the multiple
>> degree-of-freedom tests to test whether a set of coefficients are all 0.
>> If there were no missing values, this would be done using the "general
>> linear test" which involves fitting the full model and the reduced
>> model and getting an F statistic based on the residual sums of 
>> squares and
>> degrees of freedom. How does one correctly do this when there are 
>> multiple
>> imputations? In the language of Hmisc, I'm asking how the values in
>> anova(fmi) are calculated if fmi is the result of fit.mult.impute (I've
>> looked at the anova.rms code but couldn't understand it).
>
> In ordinary regression, the "leave some variables out and compare 
> R^2"-based F-test and the quicker contrast tests are equivalent.   In 
> general we use the latter, which is expedient since it is easy to 
> estimate the imputation-corrected covariance matrix.  In the rms and 
> Hmisc packages, fit.mult.impute creates the needed fit object, then 
> anova, contrast, summary, etc. give the correct Wald tests.
>
>
>  >
>> The second question I have relates to rms/Hmisc specifically. When I use
>> fit.mult.impute, the fitted values don't appear to match the values 
>> given
>> by the predict function on the original data. Instead, they match the
>> fitted values of the last imputation. For example,
>>
>> library(rms)
>> set.seed(1)
>> x1 <- rnorm(40)
>> x2 <- 0.5*x1 + rnorm(40,0,3)
>>
>> y <- x1^2 + x2 + rnorm(40,0,3)
>> x2[40] <- NA   # create 1 missing value in x2
>> a <- aregImpute(~ y + x1 + x2, n.impute=2, nk=3) # 2 imputations
>>
>> x2.imp1 <- x2; x2.imp1[40] <- a$imputed$x2[,1]  # first imputed x2 
>> vector
>> x2.imp2 <- x2; x2.imp2[40] <- a$imputed$x2[,2]  # second imputed x2 
>> vector
>>
>> ols.imp1 <- ols(y ~ rcs(x1,3) + x2.imp1)  # model on imputation 1
>> ols.imp2 <- ols(y ~ rcs(x1,3) + x2.imp2)  # model on imputation 2
>>
>> d <- data.frame(y, x1, x2)
>> fmi <- fit.mult.impute(y ~ rcs(x1,3) + x2, ols, a, data=d)
>>
>> fmi$fitted
>> predict(fmi, d)
>>
>> I would have expected the results of fmi$fitted and predict(fmi,d) to be
>> the same except for the final NA, but they are not. Instead, 
>> fmi$fitted is
>> the same as ols.imp2$fitted. Also,
>>
>> anova(ols.imp2)
>> anova(fmi)
>>
>> unexpectedly give the same result in the ERROR line. I would be most
>> grateful if anyone could explain this to me.
>
> I need to add some more warnings.  For many of the calculations, the 
> last imputation is used.
>
> Frank
>
>>
>> Thanks,
>> Mark
>> -- 
>> Mark Seeto
>> Statistician
>>
>> National Acoustic Laboratories <http://www.nal.gov.au/>
>> A Division of Australian Hearing
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>