[R-sig-ME] Best way to handle missing data?

Malcolm Fairbrother M.Fairbrother at bristol.ac.uk
Wed Mar 4 08:42:23 CET 2015


Great, glad that worked!

On the topic of MCMC estimation with multiply imputed data, my
understanding is that you can simply merge each (m=5, 30, or whatever) of
the separate chains, and then analyse those using whatever techniques you
would have applied to a single chain. I guess this could also be useful for
assessing convergence. Others are welcome to correct me on this, however.

- Malcolm



On 4 March 2015 at 06:42, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:

> Thanks, Malcolm!  Yes, that extracts all the results I need from out of
> the multiple imputation models.  This method also seems preferable over
> using Zelig because it provides the flexibility to create the models using
> lme4, or using a different package, such as nlme or MCMCglmm.  And, yes,
> the results using this method are identical to the results from Zelig.
>
> Bonnie
>
> P.S.  By the way, that error that occurs when using summary() on the
> object created by Zelig can be avoided by using the modified version of the
> summary.MI function below (from
> http://stackoverflow.com/questions/16571580/multi-level-regression-model-on-multiply-imputed-data-set-in-r-amelia-zelig-l
> ).
>
> summary.MI <- function (object, subset = NULL, ...) {
>   if (length(object) == 0) {
>     stop('Invalid input for "subset"')
>   } else {
>     if (length(object) == 1) {
>       return(summary(object[[1]]))
>     }
>   }
>
>     getcoef <- function(obj) {
>     # S4
>     if (!isS4(obj)) {
>       coef(obj)
>     } else {
>       if ("coef3" %in% slotNames(obj)) {
>         obj at coef3
>       } else {
>         obj at coef
>       }
>     }
>   }
>
>   res <- list()
>
>   # Get indices
>   subset <- if (is.null(subset)) {
>     1:length(object)
>   } else {
>     c(subset)
>   }
>
>   # Compute the summary of all objects
>   for (k in subset) {
>     res[[k]] <- summary(object[[k]])
>   }
>
>
>   # Answer
>   ans <- list(
>     zelig = object[[1]]$name,
>     call = object[[1]]$result at call,
>     all = res
>   )
>
>   coef1 <- se1 <- NULL
>
>   for (k in subset) {
>     tmp <- coef(res[[k]])
>     coef1 <- cbind(coef1, tmp[, 1])
>     se1 <- cbind(se1, tmp[, 2])
>   }
>
>   rows <- nrow(coef1)
>   Q <- apply(coef1, 1, mean)
>   U <- apply(se1^2, 1, mean)
>   B <- apply((coef1-Q)^2, 1, sum)/(length(subset)-1)
>   var <- U+(1+1/length(subset))*B
>   nu <- (length(subset)-1)*(1+U/((1+1/length(subset))*B))^2
>
>   coef.table <- matrix(NA, nrow = rows, ncol = 4)
>   dimnames(coef.table) <- list(rownames(coef1),
>                                c("Value", "Std. Error", "t-stat",
> "p-value"))
>   coef.table[,1] <- Q
>   coef.table[,2] <- sqrt(var)
>   coef.table[,3] <- Q/sqrt(var)
>   coef.table[,4] <- pt(abs(Q/sqrt(var)), df=nu, lower.tail=F)*2
>   ans$coefficients <- coef.table
>   ans$cov.scaled <- ans$cov.unscaled <- NULL
>
>   for (i in 1:length(ans)) {
>     if (is.numeric(ans[[i]]) && !names(ans)[i] %in% c("coefficients")) {
>       tmp <- NULL
>       for (j in subset) {
>         r <- res[[j]]
>         tmp <- cbind(tmp, r[[pmatch(names(ans)[i], names(res[[j]]))]])
>       }
>       ans[[i]] <- apply(tmp, 1, mean)
>     }
>   }
>
>   class(ans) <- "summaryMI"
>   ans
> }
>
> On Tue, Mar 3, 2015 at 1:53 AM, Malcolm Fairbrother <
> M.Fairbrother at bristol.ac.uk> wrote:
>
>> Hi Bonnie,
>>
>> I was getting an error with that code, and finding Zelig cumbersome. So
>> how about just:
>>
>> mods <- lapply(africa.am[[1]], function(x) lmer(gdp_pc ~ infl + (infl |
>> country), data=x))
>> rowMeans(sapply(mods, fixef)) # fixed effects
>> rowMeans(sapply(mods, function(x) as.data.frame(VarCorr(x))$vcov)) #
>> random effects variances
>> Reduce("+", sapply(mods, ranef))/length(mods) # random effects
>> (intercepts and slopes)
>> sqrt(rowMeans(sapply(mods, function(x) diag(vcov(x)))) +
>> diag(var(t(sapply(mods, fixef))))*(1+1/length(mods))) # SEs
>>
>> I think that gets you everything you want?
>>
>> The last row of code is my interpretation of: "The variance of the point
>> estimate is the average of the estimated variances from within each
>> completed data set, plus the sample variance in the point estimates across
>> the data sets (multiplied by a factor that corrects for the bias because m
>> < ∞)." (from http://r.iq.harvard.edu/docs/amelia/amelia.pdf)
>>
>> Does that correspond to what you were getting via Zelig? I'd be
>> interested to know that this worked, actually.
>>
>> Cheers,
>> Malcolm
>>
>>
>>
>>
>> On 2 March 2015 at 20:37, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
>>
>>> Thanks for this suggestion, Malcolm.  Here is an example in which I use
>>> Amelia/Zelig with the "africa" data set that is available in Amelia.
>>> I extracted the average standard deviation of the random effects from the
>>> result produced by Zelig.  (In this example, I am using the version of the
>>> summary.MI function found here:
>>> http://stackoverflow.com/questions/16571580/multi-level-regression-model-on-multiply-imputed-data-set-in-r-amelia-zelig-l)
>>>  Perhaps this approach will work for my purposes.
>>>
>>> # Get packages
>>> require(Amelia)
>>> require(Zelig)
>>> require(ZeligMultilevel)
>>>
>>> # Look at the data
>>> data(africa)
>>> head(africa)
>>> summary(africa)
>>> help(africa)
>>>
>>> # Impute the missing data
>>> africa.am <-
>>>   amelia(x = africa,
>>>          m = 30,
>>>          cs = "country",
>>>          ts = "year",
>>>          logs = "gdp_pc")
>>> summary(africa.am)
>>> plot(africa.am)
>>> missmap(africa.am)
>>> names(africa.am)
>>>
>>> # Create a model:
>>> africa.z <-
>>>   zelig(formula = gdp_pc ~ infl + tag(infl | country),
>>>         data = africa.am$imputations,
>>>         model = "ls.mixed")
>>>
>>> # The combined fixed effects:
>>> summary(africa.z)
>>>
>>> # The average standard deviation of the random intercepts and slopes:
>>> ran.ints <-
>>>   sapply(africa.z,
>>>          function(x)
>>>            attributes(VarCorr(x$result)$country)$stddev["(Intercept)"])
>>> mean(ran.ints)
>>>
>>> ran.slopes <-
>>>   sapply(africa.z,
>>>          function(x)
>>>            attributes(VarCorr(x$result)$country)$stddev["infl"])
>>> mean(ran.slopes)
>>>
>>>
>>>
>>> On Fri, Feb 27, 2015 at 4:47 AM, Malcolm Fairbrother <
>>> M.Fairbrother at bristol.ac.uk> wrote:
>>>
>>>> Hi Bonnie,
>>>>
>>>> I have not seen a formal treatment of this issue, but from the Amelia
>>>> documentation, my understanding is that if you want an estimate of the
>>>> random effects variance, you can just take the average of the estimates
>>>> from the model fitted to each imputed dataset. This is true for any
>>>> parameter, from the sounds of what Honaker, King, and Blackwell have
>>>> written.
>>>>
>>>>  "you can combine directly and use as the multiple imputation estimate
>>>> of this parameter, q ̄, the average of them separate estimates"
>>>>
>>>> Even if Zelig doesn't report the RE variance estimates automatically,
>>>> they must be "in there" somewhere... I'm sure you can extract them. Or
>>>> maybe skip Zelig, and just use Amelia, and extract the estimated RE
>>>> variances from each fitted model (presumably using lme4)?
>>>>
>>>> Cheers,
>>>> Malcolm
>>>>
>>>>
>>>> Date: Thu, 26 Feb 2015 21:20:33 -0800
>>>>> From: Bonnie Dixon <bmdixon at ucdavis.edu>
>>>>> To: Mitchell Maltenfort <mmalten at gmail.com>
>>>>> Cc: "r-sig-mixed-models at r-project.org"
>>>>>         <r-sig-mixed-models at r-project.org>
>>>>> Subject: Re: [R-sig-ME] Best way to handle missing data?
>>>>>
>>>>>
>>>>> I actually did try mice also (method "2l.norm"), but it seemed that
>>>>> Amelia
>>>>> was preferable for imputation.  Mice seems to only be able to impute
>>>>> one
>>>>> variable, whereas Amelia can impute as many variables as have missing
>>>>> data
>>>>> producing 100% complete data sets as output.
>>>>>
>>>>> However, most of the missing data in the data set I am working with is
>>>>> in
>>>>> just one variable, so I could consider using mice, and just imputing
>>>>> the
>>>>> variable that has the most missing data, while omitting observations
>>>>> that
>>>>> have missing data in any of the other variables.  But the pooled
>>>>> results
>>>>> from mice only seem to include the fixed effects of the model, so this
>>>>> still leaves me wondering how to report the random effects, which are
>>>>> very
>>>>> important to my research question.
>>>>>
>>>>> When using Amelia to impute, the packages Zelig and ZeligMultilevel
>>>>> can be
>>>>> used to combine the results from each of the models.  But again, only
>>>>> the
>>>>> fixed effects seem to be included in the output, so I am not sure how
>>>>> to
>>>>> report on the random effects.
>>>>>
>>>>> Bonnie
>>>>>
>>>>> On Thu, Feb 26, 2015 at 8:33 PM, Mitchell Maltenfort <
>>>>> mmalten at gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Mice might be the package you need
>>>>> >
>>>>> >
>>>>> > On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu>
>>>>> wrote:
>>>>> >
>>>>> >> Dear list;
>>>>> >>
>>>>> >> I am using nlme to create a repeated measures (i.e. 2 level)
>>>>> model.  There
>>>>> >> is missing data in several of the predictor variables.  What is the
>>>>> best
>>>>> >> way to handle this situation?  The variable with (by far) the most
>>>>> missing
>>>>> >> data is the best predictor in the model, so I would not want to
>>>>> remove it.
>>>>> >> I am also trying to avoid omitting the observations with missing
>>>>> data,
>>>>> >> because that would require omitting almost 40% of the observations
>>>>> and
>>>>> >> would result in a substantial loss of power.
>>>>> >>
>>>>> >> A member of my dissertation committee who uses SAS, recommended
>>>>> that I use
>>>>> >> full information maximum likelihood estimation (FIML) (described
>>>>> here:
>>>>> >>
>>>>> http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf
>>>>> >> ),
>>>>> >> which is the easiest way to handle missing data in SAS.  Is there an
>>>>> >> equivalent procedure in R?
>>>>> >>
>>>>> >> Alternatively, I have tried several approaches to multiple
>>>>> imputation.
>>>>> >> For
>>>>> >> example, I used the package, Amelia, which appears to handle the
>>>>> clustered
>>>>> >> structure of the data appropriately, to generate five imputed
>>>>> versions of
>>>>> >> the data set, and then used lapply to run my model on each.  But I
>>>>> am not
>>>>> >> sure how to combine the resulting five models into one final
>>>>> result.  I
>>>>> >> will need a final result that enables me to report, not just the
>>>>> fixed
>>>>> >> effects of the model, but also the random effects variance
>>>>> components and,
>>>>> >> ideally, the distributions across the population of the random
>>>>> intercept
>>>>> >> and slopes, and correlations between them.
>>>>> >>
>>>>> >> Many thanks for any suggestions on how to proceed.
>>>>> >>
>>>>> >> Bonnie
>>>>>
>>>>
>>>
>>
>

	[[alternative HTML version deleted]]



More information about the R-sig-mixed-models mailing list