[R-sig-ME] Formula df when combining imputed data

Thu Jun 1 18:20:47 CEST 2017

Dear list,

In a given dataset, I have 10 dichotomous variables, the missing values 
of which were substituted by multiple imputation techniques. For each 
variable, a glmmPQL model was estimated. The model has a random 
intercept across countries and schools-within-countries. For one of the 
10 variables the syntax is:

themodel <- glmmPQL( yvariable ~
1+Gender+AGE+migrant+rep+missing_rep+Schoolsize+Schoolmaterials+GDP,
                       random = list(country = ~ 1, CNTSCHID = ~ 1),
                       family=binomial, data=pisas)

The results show:

                     Value Std.Error    DF    t-value p-value
(Intercept)     -6.221943 0.9882684 15501  -6.295802  0.0000
Gender          -0.493744 0.0363663 15501 -13.576956  0.0000
AGE              0.177124 0.0612163 15501   2.893407  0.0038
migrant         -0.311810 0.0867452 15501  -3.594553  0.0003
rep             -2.510684 0.2342525 15501 -10.717855  0.0000
missing_rep     -1.986272 0.3907376 15501  -5.083390  0.0000
Schoolsize       0.000536 0.0000633  7699   8.470045  0.0000
Schoolmaterials  0.334300 0.0525469  7699   6.361938  0.0000
GDP              0.000013 0.0000081    31   1.598743  0.1200

I ran this model for each of the 10 imputed variables and then combined 
the results using the method proposed by Rubin, which is also explained 
by Carlin et al. in fmwww.bc.edu/RePEc/bocode/c/carlin.pdf. Equation (2) 
on page 4 shows how to calculate the nr. of df for t-tests for each 
regression coefficient. This is where I got stuck. Evaluating the 
formula for the df leads to the nr.'s of df below:

                             DF given equation (2)

(Intercept)       19.91120
Gender            18.75981
AGE               19.63237
migrant           21.30057
rep               28.47710
missing_rep      133.05131
Schoolsize       122.45054
Schoolmaterials   74.71955
GDP             7231.16666

As can be noticed, the df's in the above glmmPQL results are very 
different from those calculated by equation (2) mentioned by Carlin et 
al. in the Stata journal. I realize that the ones in the glmmPQL results 
cannot be entirely correct, due to the fact that the yvariable's 
missings were imputed and next analyzed as though it had no missings at 
all. But I'm wondering also if the df's calculated by equation (2) are 
the "better" ones, because the differences are so large. E.g. for GDP, 
which is a country-level variable, I would expect a low nr. of df's, as 
there are only 33 countries in the data. Could it be that the formula of 
equation (2) for the nr. of df cannot be used here? Or even worse: for a 
logistic model with random country and school effects, the method 
proposed by Rubin for calculating the std. errors of the regression 
coefficients is not really applicable?

Thanks for any advice!!
Ben Pelzer.

	[[alternative HTML version deleted]]