[R-sig-ME] Best way to handle missing data?
Malcolm Fairbrother
M.Fairbrother at bristol.ac.uk
Wed Mar 4 08:42:23 CET 2015
Great, glad that worked!
On the topic of MCMC estimation with multiply imputed data, my
understanding is that you can simply merge each (m=5, 30, or whatever) of
the separate chains, and then analyse those using whatever techniques you
would have applied to a single chain. I guess this could also be useful for
assessing convergence. Others are welcome to correct me on this, however.
- Malcolm
On 4 March 2015 at 06:42, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
> Thanks, Malcolm! Yes, that extracts all the results I need from out of
> the multiple imputation models. This method also seems preferable over
> using Zelig because it provides the flexibility to create the models using
> lme4, or using a different package, such as nlme or MCMCglmm. And, yes,
> the results using this method are identical to the results from Zelig.
>
> Bonnie
>
> P.S. By the way, that error that occurs when using summary() on the
> object created by Zelig can be avoided by using the modified version of the
> summary.MI function below (from
> http://stackoverflow.com/questions/16571580/multi-level-regression-model-on-multiply-imputed-data-set-in-r-amelia-zelig-l
> ).
>
> summary.MI <- function (object, subset = NULL, ...) {
> if (length(object) == 0) {
> stop('Invalid input for "subset"')
> } else {
> if (length(object) == 1) {
> return(summary(object[[1]]))
> }
> }
>
> getcoef <- function(obj) {
> # S4
> if (!isS4(obj)) {
> coef(obj)
> } else {
> if ("coef3" %in% slotNames(obj)) {
> obj at coef3
> } else {
> obj at coef
> }
> }
> }
>
> res <- list()
>
> # Get indices
> subset <- if (is.null(subset)) {
> 1:length(object)
> } else {
> c(subset)
> }
>
> # Compute the summary of all objects
> for (k in subset) {
> res[[k]] <- summary(object[[k]])
> }
>
>
> # Answer
> ans <- list(
> zelig = object[[1]]$name,
> call = object[[1]]$result at call,
> all = res
> )
>
> coef1 <- se1 <- NULL
>
> for (k in subset) {
> tmp <- coef(res[[k]])
> coef1 <- cbind(coef1, tmp[, 1])
> se1 <- cbind(se1, tmp[, 2])
> }
>
> rows <- nrow(coef1)
> Q <- apply(coef1, 1, mean)
> U <- apply(se1^2, 1, mean)
> B <- apply((coef1-Q)^2, 1, sum)/(length(subset)-1)
> var <- U+(1+1/length(subset))*B
> nu <- (length(subset)-1)*(1+U/((1+1/length(subset))*B))^2
>
> coef.table <- matrix(NA, nrow = rows, ncol = 4)
> dimnames(coef.table) <- list(rownames(coef1),
> c("Value", "Std. Error", "t-stat",
> "p-value"))
> coef.table[,1] <- Q
> coef.table[,2] <- sqrt(var)
> coef.table[,3] <- Q/sqrt(var)
> coef.table[,4] <- pt(abs(Q/sqrt(var)), df=nu, lower.tail=F)*2
> ans$coefficients <- coef.table
> ans$cov.scaled <- ans$cov.unscaled <- NULL
>
> for (i in 1:length(ans)) {
> if (is.numeric(ans[[i]]) && !names(ans)[i] %in% c("coefficients")) {
> tmp <- NULL
> for (j in subset) {
> r <- res[[j]]
> tmp <- cbind(tmp, r[[pmatch(names(ans)[i], names(res[[j]]))]])
> }
> ans[[i]] <- apply(tmp, 1, mean)
> }
> }
>
> class(ans) <- "summaryMI"
> ans
> }
>
> On Tue, Mar 3, 2015 at 1:53 AM, Malcolm Fairbrother <
> M.Fairbrother at bristol.ac.uk> wrote:
>
>> Hi Bonnie,
>>
>> I was getting an error with that code, and finding Zelig cumbersome. So
>> how about just:
>>
>> mods <- lapply(africa.am[[1]], function(x) lmer(gdp_pc ~ infl + (infl |
>> country), data=x))
>> rowMeans(sapply(mods, fixef)) # fixed effects
>> rowMeans(sapply(mods, function(x) as.data.frame(VarCorr(x))$vcov)) #
>> random effects variances
>> Reduce("+", sapply(mods, ranef))/length(mods) # random effects
>> (intercepts and slopes)
>> sqrt(rowMeans(sapply(mods, function(x) diag(vcov(x)))) +
>> diag(var(t(sapply(mods, fixef))))*(1+1/length(mods))) # SEs
>>
>> I think that gets you everything you want?
>>
>> The last row of code is my interpretation of: "The variance of the point
>> estimate is the average of the estimated variances from within each
>> completed data set, plus the sample variance in the point estimates across
>> the data sets (multiplied by a factor that corrects for the bias because m
>> < ∞)." (from http://r.iq.harvard.edu/docs/amelia/amelia.pdf)
>>
>> Does that correspond to what you were getting via Zelig? I'd be
>> interested to know that this worked, actually.
>>
>> Cheers,
>> Malcolm
>>
>>
>>
>>
>> On 2 March 2015 at 20:37, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
>>
>>> Thanks for this suggestion, Malcolm. Here is an example in which I use
>>> Amelia/Zelig with the "africa" data set that is available in Amelia.
>>> I extracted the average standard deviation of the random effects from the
>>> result produced by Zelig. (In this example, I am using the version of the
>>> summary.MI function found here:
>>> http://stackoverflow.com/questions/16571580/multi-level-regression-model-on-multiply-imputed-data-set-in-r-amelia-zelig-l)
>>> Perhaps this approach will work for my purposes.
>>>
>>> # Get packages
>>> require(Amelia)
>>> require(Zelig)
>>> require(ZeligMultilevel)
>>>
>>> # Look at the data
>>> data(africa)
>>> head(africa)
>>> summary(africa)
>>> help(africa)
>>>
>>> # Impute the missing data
>>> africa.am <-
>>> amelia(x = africa,
>>> m = 30,
>>> cs = "country",
>>> ts = "year",
>>> logs = "gdp_pc")
>>> summary(africa.am)
>>> plot(africa.am)
>>> missmap(africa.am)
>>> names(africa.am)
>>>
>>> # Create a model:
>>> africa.z <-
>>> zelig(formula = gdp_pc ~ infl + tag(infl | country),
>>> data = africa.am$imputations,
>>> model = "ls.mixed")
>>>
>>> # The combined fixed effects:
>>> summary(africa.z)
>>>
>>> # The average standard deviation of the random intercepts and slopes:
>>> ran.ints <-
>>> sapply(africa.z,
>>> function(x)
>>> attributes(VarCorr(x$result)$country)$stddev["(Intercept)"])
>>> mean(ran.ints)
>>>
>>> ran.slopes <-
>>> sapply(africa.z,
>>> function(x)
>>> attributes(VarCorr(x$result)$country)$stddev["infl"])
>>> mean(ran.slopes)
>>>
>>>
>>>
>>> On Fri, Feb 27, 2015 at 4:47 AM, Malcolm Fairbrother <
>>> M.Fairbrother at bristol.ac.uk> wrote:
>>>
>>>> Hi Bonnie,
>>>>
>>>> I have not seen a formal treatment of this issue, but from the Amelia
>>>> documentation, my understanding is that if you want an estimate of the
>>>> random effects variance, you can just take the average of the estimates
>>>> from the model fitted to each imputed dataset. This is true for any
>>>> parameter, from the sounds of what Honaker, King, and Blackwell have
>>>> written.
>>>>
>>>> "you can combine directly and use as the multiple imputation estimate
>>>> of this parameter, q ̄, the average of them separate estimates"
>>>>
>>>> Even if Zelig doesn't report the RE variance estimates automatically,
>>>> they must be "in there" somewhere... I'm sure you can extract them. Or
>>>> maybe skip Zelig, and just use Amelia, and extract the estimated RE
>>>> variances from each fitted model (presumably using lme4)?
>>>>
>>>> Cheers,
>>>> Malcolm
>>>>
>>>>
>>>> Date: Thu, 26 Feb 2015 21:20:33 -0800
>>>>> From: Bonnie Dixon <bmdixon at ucdavis.edu>
>>>>> To: Mitchell Maltenfort <mmalten at gmail.com>
>>>>> Cc: "r-sig-mixed-models at r-project.org"
>>>>> <r-sig-mixed-models at r-project.org>
>>>>> Subject: Re: [R-sig-ME] Best way to handle missing data?
>>>>>
>>>>>
>>>>> I actually did try mice also (method "2l.norm"), but it seemed that
>>>>> Amelia
>>>>> was preferable for imputation. Mice seems to only be able to impute
>>>>> one
>>>>> variable, whereas Amelia can impute as many variables as have missing
>>>>> data
>>>>> producing 100% complete data sets as output.
>>>>>
>>>>> However, most of the missing data in the data set I am working with is
>>>>> in
>>>>> just one variable, so I could consider using mice, and just imputing
>>>>> the
>>>>> variable that has the most missing data, while omitting observations
>>>>> that
>>>>> have missing data in any of the other variables. But the pooled
>>>>> results
>>>>> from mice only seem to include the fixed effects of the model, so this
>>>>> still leaves me wondering how to report the random effects, which are
>>>>> very
>>>>> important to my research question.
>>>>>
>>>>> When using Amelia to impute, the packages Zelig and ZeligMultilevel
>>>>> can be
>>>>> used to combine the results from each of the models. But again, only
>>>>> the
>>>>> fixed effects seem to be included in the output, so I am not sure how
>>>>> to
>>>>> report on the random effects.
>>>>>
>>>>> Bonnie
>>>>>
>>>>> On Thu, Feb 26, 2015 at 8:33 PM, Mitchell Maltenfort <
>>>>> mmalten at gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Mice might be the package you need
>>>>> >
>>>>> >
>>>>> > On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu>
>>>>> wrote:
>>>>> >
>>>>> >> Dear list;
>>>>> >>
>>>>> >> I am using nlme to create a repeated measures (i.e. 2 level)
>>>>> model. There
>>>>> >> is missing data in several of the predictor variables. What is the
>>>>> best
>>>>> >> way to handle this situation? The variable with (by far) the most
>>>>> missing
>>>>> >> data is the best predictor in the model, so I would not want to
>>>>> remove it.
>>>>> >> I am also trying to avoid omitting the observations with missing
>>>>> data,
>>>>> >> because that would require omitting almost 40% of the observations
>>>>> and
>>>>> >> would result in a substantial loss of power.
>>>>> >>
>>>>> >> A member of my dissertation committee who uses SAS, recommended
>>>>> that I use
>>>>> >> full information maximum likelihood estimation (FIML) (described
>>>>> here:
>>>>> >>
>>>>> http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf
>>>>> >> ),
>>>>> >> which is the easiest way to handle missing data in SAS. Is there an
>>>>> >> equivalent procedure in R?
>>>>> >>
>>>>> >> Alternatively, I have tried several approaches to multiple
>>>>> imputation.
>>>>> >> For
>>>>> >> example, I used the package, Amelia, which appears to handle the
>>>>> clustered
>>>>> >> structure of the data appropriately, to generate five imputed
>>>>> versions of
>>>>> >> the data set, and then used lapply to run my model on each. But I
>>>>> am not
>>>>> >> sure how to combine the resulting five models into one final
>>>>> result. I
>>>>> >> will need a final result that enables me to report, not just the
>>>>> fixed
>>>>> >> effects of the model, but also the random effects variance
>>>>> components and,
>>>>> >> ideally, the distributions across the population of the random
>>>>> intercept
>>>>> >> and slopes, and correlations between them.
>>>>> >>
>>>>> >> Many thanks for any suggestions on how to proceed.
>>>>> >>
>>>>> >> Bonnie
>>>>>
>>>>
>>>
>>
>
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list