[R] Multiple imputation, especially in rms/Hmisc packages

Wed Aug 11 14:58:25 CEST 2010

Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University

On Tue, 10 Aug 2010, Mark Seeto wrote:

> Hi Frank (and others),
>
> Thank you for your comments in reply to my questions. I had not
> encountered contrast tests before. I've looked in a few texts, but the
> only place I could find anything about contrast tests was your
> Regression Modeling Strategies book.
>
> You wrote that the "leave some variables out" F-test and the contrast
> tests are equivalent in ordinary regression, so I have tried to verify
> this. However, I got different results for the two tests, so I'm hoping
> you can point out what I'm doing incorrectly.
>
> The null hypothesis is CB = 0, where C is the contrast matrix and B is
> the vector of regression coefficients. I'm using the test statistic on
> p. 189 of RMS:
> W = (Cb)'(CVC')^{-1}(Cb), with b being the vector of coefficient
> estimates and V being its covariance matrix.
>
> The model is y ~ rcs(x1,3) + x2, and I want to test the null hypothesis
> that the two x1 coefficients are both zero. This is equivalent to CB =
> 0, with C being the matrix
> [0 1 0 0 (first row); 0 0 1 0 (second row)].
>
> library(rms)
> set.seed(1)
> x1 <- rnorm(50)
> x2 <- rnorm(50)
> y <- 1 + x1 + x2 + rnorm(50,0,3)
>
> ols.full <- ols(y ~ rcs(x1,3) + x2) # full model
> ols.red <- ols(y ~ x2) # reduced model
>
> RSS.full <- sum((ols.full$fitted - y)^2) # residual sums of squares
> RSS.red <- sum((ols.red$fitted - y)^2)
>
> F <- ((RSS.red - RSS.full)/(48 - 46))/(RSS.full/46)  # test statistic
> for F-test
> F
> [1] 2.576335
>
> 1 - pf(F, 2, 46)
> [1] 0.08698798
>
> anova(ols.full)
>                Analysis of Variance          Response: y
>
> Factor     d.f. Partial SS   MS          F    P
> x1          2    39.95283222 19.97641611 2.58 0.0870
>  Nonlinear  1     0.07360774  0.07360774 0.01 0.9228
> x2          1    47.17322335 47.17322335 6.08 0.0174
> REGRESSION  3    83.79617922 27.93205974 3.60 0.0203
> ERROR      46   356.67539200  7.75381287
>
> anova(ols.full)["x1", "F"]
> [1] 2.576335    # This confirms that the result of anova is the same as
> that of the "leave some variables out" F-test.
>
> V <- ols.full$var
> C <- matrix(c(0,1,0,0, 0,0,1,0), nrow=2, byrow=TRUE)
> b <- ols.full$coef
> W <- t(C %*% b) %*% solve(C %*% V %*% t(C)) %*% (C %*% b)
>
> 1 - pchisq(W, 2)
>           [,1]
> [1,] 0.07605226
>
> Have I used the correct distribution for W? Should I expect the p-values
> from the two tests to be exactly the same?

Mark - you can see the code for this at the bottom of anova.rms. 
Compute W, divide by numerator d.f., then compute P-value using F with 
numerator and error d.f.

Frank

  >
> Thanks in advance; I really appreciate any help you can give.
>
> Mark
>
> Frank Harrell wrote:
>>
>> On Mon, 9 Aug 2010, Mark Seeto wrote:
>>
>>> Hello, I have a general question about combining imputations as well
>>> as a
>>> question specific to the rms and Hmisc packages.
>>>
>>> The situation is multiple regression on a data set where multiple
>>> imputation has been used to give M imputed data sets. I know how to get
>>> the combined estimate of the covariance matrix of the estimated
>>> coefficients (average the M covariance matrices from the individual
>>> imputations and add (M+1)/M times the between-imputation covariance
>>> matrix), and I know how to use this to get p-values and confidence
>>> intervals for the individual coefficients.
>>>
>>> What I don't know is how to combine imputations to obtain the multiple
>>> degree-of-freedom tests to test whether a set of coefficients are all 0.
>>> If there were no missing values, this would be done using the "general
>>> linear test" which involves fitting the full model and the reduced
>>> model and getting an F statistic based on the residual sums of
>>> squares and
>>> degrees of freedom. How does one correctly do this when there are
>>> multiple
>>> imputations? In the language of Hmisc, I'm asking how the values in
>>> anova(fmi) are calculated if fmi is the result of fit.mult.impute (I've
>>> looked at the anova.rms code but couldn't understand it).
>>
>> In ordinary regression, the "leave some variables out and compare
>> R^2"-based F-test and the quicker contrast tests are equivalent.   In
>> general we use the latter, which is expedient since it is easy to
>> estimate the imputation-corrected covariance matrix.  In the rms and
>> Hmisc packages, fit.mult.impute creates the needed fit object, then
>> anova, contrast, summary, etc. give the correct Wald tests.
>>
>>
>> >
>>> The second question I have relates to rms/Hmisc specifically. When I use
>>> fit.mult.impute, the fitted values don't appear to match the values
>>> given
>>> by the predict function on the original data. Instead, they match the
>>> fitted values of the last imputation. For example,
>>>
>>> library(rms)
>>> set.seed(1)
>>> x1 <- rnorm(40)
>>> x2 <- 0.5*x1 + rnorm(40,0,3)
>>>
>>> y <- x1^2 + x2 + rnorm(40,0,3)
>>> x2[40] <- NA   # create 1 missing value in x2
>>> a <- aregImpute(~ y + x1 + x2, n.impute=2, nk=3) # 2 imputations
>>>
>>> x2.imp1 <- x2; x2.imp1[40] <- a$imputed$x2[,1]  # first imputed x2
>>> vector
>>> x2.imp2 <- x2; x2.imp2[40] <- a$imputed$x2[,2]  # second imputed x2
>>> vector
>>>
>>> ols.imp1 <- ols(y ~ rcs(x1,3) + x2.imp1)  # model on imputation 1
>>> ols.imp2 <- ols(y ~ rcs(x1,3) + x2.imp2)  # model on imputation 2
>>>
>>> d <- data.frame(y, x1, x2)
>>> fmi <- fit.mult.impute(y ~ rcs(x1,3) + x2, ols, a, data=d)
>>>
>>> fmi$fitted
>>> predict(fmi, d)
>>>
>>> I would have expected the results of fmi$fitted and predict(fmi,d) to be
>>> the same except for the final NA, but they are not. Instead,
>>> fmi$fitted is
>>> the same as ols.imp2$fitted. Also,
>>>
>>> anova(ols.imp2)
>>> anova(fmi)
>>>
>>> unexpectedly give the same result in the ERROR line. I would be most
>>> grateful if anyone could explain this to me.
>>
>> I need to add some more warnings.  For many of the calculations, the
>> last imputation is used.
>>
>> Frank
>>
>>>
>>> Thanks,
>>> Mark
>>> --
>>> Mark Seeto
>>> Statistician
>>>
>>> National Acoustic Laboratories <http://www.nal.gov.au/>
>>> A Division of Australian Hearing
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>