[R] multiple imputation with fit.mult.impute in Hmisc - how to replace NA with imputed value?
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Wed Nov 26 17:46:11 CET 2008
Charlie Brush wrote:
> Frank E Harrell Jr wrote:
>> Charlie Brush wrote:
>>> I am doing multiple imputation with Hmisc, and
>>> can't figure out how to replace the NA values with
>>> the imputed values.
>>>
>>> Here's a general ourline of the process:
>>>
>>> > set.seed(23)
>>> > library("mice")
>>> > library("Hmisc")
>>> > library("Design")
>>> > d <- read.table("DailyDataRaw_01.txt",header=T)
>>> > length(d);length(d[,1])
>>> [1] 43
>>> [1] 2666
>>> Do for this data set, there are 43 columns and 2666 rows
>>>
>>> Here is a piece of data.frame d:
>>> > d[1:20,4:6]
>>> P01 P02 P03
>>> 1 0.1 0.16 0.16
>>> 2 NA 0.00 0.00
>>> 3 NA 0.60 0.04
>>> 4 NA 0.15 0.00
>>> 5 NA 0.00 0.00
>>> 6 0.7 0.00 0.75
>>> 7 NA 0.00 0.00
>>> 8 NA 0.00 0.00
>>> 9 0.0 0.00 0.00
>>> 10 0.0 0.00 0.00
>>> 11 0.0 0.00 0.00
>>> 12 0.0 0.00 0.00
>>> 13 0.0 0.00 0.00
>>> 14 0.0 0.00 0.00
>>> 15 0.0 0.00 0.03
>>> 16 NA 0.00 0.00
>>> 17 NA 0.01 0.00
>>> 18 0.0 0.00 0.00
>>> 19 0.0 0.00 0.00
>>> 20 0.0 0.00 0.00
>>>
>>> These are daily precipitation values at NCDC stations, and
>>> NA values at station P01 will be filled using multiple
>>> imputation and data from highly correlated stations P02 and P08:
>>>
>>> > f <- aregImpute(~ I(P01) + I(P02) + I(P08),
>>> n.impute=10,match='closest',data=d)
>>> Iteration 13
>>> > fmi <- fit.mult.impute( P01 ~ P02 + P08 , ols, f, d)
>>>
>>> Variance Inflation Factors Due to Imputation:
>>>
>>> Intercept P02 P08
>>> 1.01 1.39 1.16
>>>
>>> Rate of Missing Information:
>>>
>>> Intercept P02 P08
>>> 0.01 0.28 0.14
>>>
>>> d.f. for t-distribution for Tests of Single Coefficients:
>>>
>>> Intercept P02 P08
>>> 242291.18 116.05 454.95
>>> > r <- apply(f$imputed$P01,1,mean)
>>> > r
>>> 2 3 4 5 7 8 16 17 249 250 251
>>> 0.002 0.430 0.044 0.002 0.002 0.002 0.002 0.123 0.002 0.002 0.002
>>> 252 253 254 255 256 257 258 259 260 261 262
>>> 1.033 0.529 1.264 0.611 0.002 0.513 0.085 0.002 0.705 0.840 0.719
>>> 263 264 265 266 267 268 269 270 271 272 273
>>> 1.489 0.532 0.150 0.134 0.002 0.002 0.002 0.002 0.002 0.055 0.135
>>> 274 275 276 277 278 279 280 281 282 283 284
>>> 0.009 0.002 0.002 0.002 0.008 0.454 1.676 1.462 0.071 0.002 1.029
>>> 285 286 287 288 289 418 419 420 421 422 700
>>> 0.055 0.384 0.947 0.002 0.002 0.008 0.759 0.066 0.009 0.002 0.002
>>>
>>> ------------------------------------------------------------------
>>> So far, this is working great.
>>> Now, make a copy of d:
>>> > dnew <- d
>>>
>>> And then fill in the NA values in P01 with the values in r
>>>
>>> For example:
>>> > for (i in 1:length(r)){
>>> dnew$P01[r[i,1]] <- r[i,2]
>>> }
>>> This doesn't work, because each 'piece' of r is two numbers:
>>> > r[1]
>>> 2
>>> 0.002
>>> > r[1,1]
>>> Error in r[1, 1] : incorrect number of dimensions
>>>
>>> My question: how can I separate the the two items in (for example)
>>> r[1] to use the first part as an index and the second as a value,
>>> and then use them to replace the NA values with the imputed values?
>>>
>>> Or is there a better way to replace the NA values with the imputed
>>> values?
>>>
>>> Thanks in advance for any help.
>>>
>>
>> You didn't state your goal, and why fit.mult.impute does not do what
>> you want. But you can look inside fit.mult.impute to see how it
>> retrieves the imputed values. Also see the example in documentation
>> for transcan in which the command impute(xt, imputation=1) to retrieve
>> one of the multiple imputations.
>>
>> Note that you can say library(Design) (omit the quotes) to access both
>> Design and Hmisc.
>>
>> Frank
> Thanks for your help.
> My goal is to replace the NA values in the (copy of the) data frame with
> the means of the imputed values (which are now in variable 'r').
> fit.mult.impute works fine. I just can't figure out the last step,
> taking the results of fit.mult.impute (which are in variable 'r') and
> replacing the NA values in the (copy of the) data frame.
> A simple for loop doesn't work because the items in 'r' don't look like
> a normal vector, as for example r[1] returns
> 2
> 0.002
> Is there a command to replace the NA values in the data frame with the
> means of the imputed values?
>
> Thanks,
> Charlie
>
Don't do that, as this would no longer be multiple imputation. If you
want single conditional mean imputation use transcan.
Frank
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list