[R-sig-ME] Formula df when combining imputed data

Thu Jun 1 18:24:16 CEST 2017

Sorry list, but I forgot to mention that the 10 variables where actually 
10 imputed versions of the same underlying variable, which has missings. 
Therefore, Rubin's method of combining regression results comes into the 
picture.

Ben Pelzer.

On 1-6-2017 18:20, Ben Pelzer wrote:
> Dear list,
>
> In a given dataset, I have 10 dichotomous variables, the missing values
> of which were substituted by multiple imputation techniques. For each
> variable, a glmmPQL model was estimated. The model has a random
> intercept across countries and schools-within-countries. For one of the
> 10 variables the syntax is:
>
> themodel <- glmmPQL( yvariable ~
> 1+Gender+AGE+migrant+rep+missing_rep+Schoolsize+Schoolmaterials+GDP,
>                         random = list(country = ~ 1, CNTSCHID = ~ 1),
>                         family=binomial, data=pisas)
>
> The results show:
>
>                       Value Std.Error    DF    t-value p-value
> (Intercept)     -6.221943 0.9882684 15501  -6.295802  0.0000
> Gender          -0.493744 0.0363663 15501 -13.576956  0.0000
> AGE              0.177124 0.0612163 15501   2.893407  0.0038
> migrant         -0.311810 0.0867452 15501  -3.594553  0.0003
> rep             -2.510684 0.2342525 15501 -10.717855  0.0000
> missing_rep     -1.986272 0.3907376 15501  -5.083390  0.0000
> Schoolsize       0.000536 0.0000633  7699   8.470045  0.0000
> Schoolmaterials  0.334300 0.0525469  7699   6.361938  0.0000
> GDP              0.000013 0.0000081    31   1.598743  0.1200
>
>
> I ran this model for each of the 10 imputed variables and then combined
> the results using the method proposed by Rubin, which is also explained
> by Carlin et al. in fmwww.bc.edu/RePEc/bocode/c/carlin.pdf. Equation (2)
> on page 4 shows how to calculate the nr. of df for t-tests for each
> regression coefficient. This is where I got stuck. Evaluating the
> formula for the df leads to the nr.'s of df below:
>
>                               DF given equation (2)
>
> (Intercept)       19.91120
> Gender            18.75981
> AGE               19.63237
> migrant           21.30057
> rep               28.47710
> missing_rep      133.05131
> Schoolsize       122.45054
> Schoolmaterials   74.71955
> GDP             7231.16666
>
>
> As can be noticed, the df's in the above glmmPQL results are very
> different from those calculated by equation (2) mentioned by Carlin et
> al. in the Stata journal. I realize that the ones in the glmmPQL results
> cannot be entirely correct, due to the fact that the yvariable's
> missings were imputed and next analyzed as though it had no missings at
> all. But I'm wondering also if the df's calculated by equation (2) are
> the "better" ones, because the differences are so large. E.g. for GDP,
> which is a country-level variable, I would expect a low nr. of df's, as
> there are only 33 countries in the data. Could it be that the formula of
> equation (2) for the nr. of df cannot be used here? Or even worse: for a
> logistic model with random country and school effects, the method
> proposed by Rubin for calculating the std. errors of the regression
> coefficients is not really applicable?
>
> Thanks for any advice!!
> Ben Pelzer.
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models