[R-sig-ME] Formula df when combining imputed data
Ben Pelzer
b.pelzer at maw.ru.nl
Thu Jun 1 18:20:47 CEST 2017
Dear list,
In a given dataset, I have 10 dichotomous variables, the missing values
of which were substituted by multiple imputation techniques. For each
variable, a glmmPQL model was estimated. The model has a random
intercept across countries and schools-within-countries. For one of the
10 variables the syntax is:
themodel <- glmmPQL( yvariable ~
1+Gender+AGE+migrant+rep+missing_rep+Schoolsize+Schoolmaterials+GDP,
random = list(country = ~ 1, CNTSCHID = ~ 1),
family=binomial, data=pisas)
The results show:
Value Std.Error DF t-value p-value
(Intercept) -6.221943 0.9882684 15501 -6.295802 0.0000
Gender -0.493744 0.0363663 15501 -13.576956 0.0000
AGE 0.177124 0.0612163 15501 2.893407 0.0038
migrant -0.311810 0.0867452 15501 -3.594553 0.0003
rep -2.510684 0.2342525 15501 -10.717855 0.0000
missing_rep -1.986272 0.3907376 15501 -5.083390 0.0000
Schoolsize 0.000536 0.0000633 7699 8.470045 0.0000
Schoolmaterials 0.334300 0.0525469 7699 6.361938 0.0000
GDP 0.000013 0.0000081 31 1.598743 0.1200
I ran this model for each of the 10 imputed variables and then combined
the results using the method proposed by Rubin, which is also explained
by Carlin et al. in fmwww.bc.edu/RePEc/bocode/c/carlin.pdf. Equation (2)
on page 4 shows how to calculate the nr. of df for t-tests for each
regression coefficient. This is where I got stuck. Evaluating the
formula for the df leads to the nr.'s of df below:
DF given equation (2)
(Intercept) 19.91120
Gender 18.75981
AGE 19.63237
migrant 21.30057
rep 28.47710
missing_rep 133.05131
Schoolsize 122.45054
Schoolmaterials 74.71955
GDP 7231.16666
As can be noticed, the df's in the above glmmPQL results are very
different from those calculated by equation (2) mentioned by Carlin et
al. in the Stata journal. I realize that the ones in the glmmPQL results
cannot be entirely correct, due to the fact that the yvariable's
missings were imputed and next analyzed as though it had no missings at
all. But I'm wondering also if the df's calculated by equation (2) are
the "better" ones, because the differences are so large. E.g. for GDP,
which is a country-level variable, I would expect a low nr. of df's, as
there are only 33 countries in the data. Could it be that the formula of
equation (2) for the nr. of df cannot be used here? Or even worse: for a
logistic model with random country and school effects, the method
proposed by Rubin for calculating the std. errors of the regression
coefficients is not really applicable?
Thanks for any advice!!
Ben Pelzer.
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list