[R] impute missing values in correlated variables: transcan?

Jonathan Baron baron at psych.upenn.edu
Tue Nov 30 20:50:26 CET 2004


On 11/30/04 13:21, Frank E Harrell Jr wrote:
>Jonathan Baron wrote:
>> I would like to impute missing data in a set of correlated
>> variables (columns of a matrix).  It looks like transcan() from
>> Hmisc is roughly what I want.  It says, "transcan automatically
>> transforms continuous and categorical variables to have maximum
>> correlation with the best linear combination of the other
>> variables." And, "By default, transcan imputes NAs with "best
>> guess" expected values of transformed variables, back transformed
>> to the original scale."
>>
>> But I can't get it to work.  I say
>>
>> m1 <- matrix(1:20+rnorm(20),5,)  # four correlated variables
>> colnames(m1) <- paste("R",1:4,sep="")
>> m1[c(2,19)] <- NA                # simulate some missing data
>> library(Hmisc)
>> transcan(m1,data=m1)
>>
>> and I get
>>
>> Error in rcspline.eval(y, nk = nk, inclx = TRUE) :
>>       fewer than 6 non-missing observations with knots omitted
>
>Jonathan - you would need many more observations to be able to fit
>flexible additive models as transcan does.  Also note that single
>imputation has problems and you may want to consider multiple imputation
>as done by the Hmisc aregImpute function, if you had more data.

Thanks.  But they don't _need_ to be so flexible as what transcan
does.  Linear would be OK, but I can't find an option for that in
transcan.

We _will_ have more data, about 50 applicants rated by the time
we start making decisions.  So I tried my little simulation with
more data, and it didn't give an error message.  So that was the
problem.  Here is the new one:

m1 <- matrix(1:80+rnorm(80),,4)
colnames(m1) <- paste("R",1:4,sep="")
m1[c(2,19)] <- NA
library(Hmisc)
t1 <- transcan(m1,data=m1,long=T,imputed=T)

I've used aregImpute, and I notice it has a "defaultlinear"
option, which is good.  Thus, it may work better once I figure
out how to get a single value out of it for each missing datum
(which doesn't look too hard).

This is not about statistical inference, which seems to me to be
where the main advantage of multiple imputation lies.  But
probably it won't do any harm.

Jon
-- 
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron
R search page: http://finzi.psych.upenn.edu/




More information about the R-help mailing list