[R] impute missing values in correlated variables: transcan?
roger koenker
rkoenker at uiuc.edu
Tue Nov 30 18:23:26 CET 2004
At the risk of stirring up a hornet's nest , I'd suggest that
means are dangerous in such applications. A nice paper
on combining ratings is: Gilbert Bassett and Joseph Persky,
Rating Skating, JASA, 1994, 1075-1079.
url: www.econ.uiuc.edu/~roger Roger Koenker
email rkoenker at uiuc.edu Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820
On Nov 30, 2004, at 10:52 AM, Jonathan Baron wrote:
> I would like to impute missing data in a set of correlated
> variables (columns of a matrix). It looks like transcan() from
> Hmisc is roughly what I want. It says, "transcan automatically
> transforms continuous and categorical variables to have maximum
> correlation with the best linear combination of the other
> variables." And, "By default, transcan imputes NAs with "best
> guess" expected values of transformed variables, back transformed
> to the original scale."
>
> But I can't get it to work. I say
>
> m1 <- matrix(1:20+rnorm(20),5,) # four correlated variables
> colnames(m1) <- paste("R",1:4,sep="")
> m1[c(2,19)] <- NA # simulate some missing data
> library(Hmisc)
> transcan(m1,data=m1)
>
> and I get
>
> Error in rcspline.eval(y, nk = nk, inclx = TRUE) :
> fewer than 6 non-missing observations with knots omitted
>
> I've tried a few other things, but I think it is time to ask for
> help.
>
> The specific problem is a real one. Our graduate admissions
> committee (4 members) rates applications, and we average the
> ratings to get an overall rating for each applicant. Sometimes
> one of the committee members is absent, or late; hence the
> missing data. The members differ in the way they use the rating
> scale, in both slope and intercept (if you regress each on the
> mean). Many decisions end up depending on the second decimal
> place of the averages, so we want to do better than just averging
> the non-missing ratings.
>
> Maybe I'm just not seeing something really simple. In fact, the
> problem is simpler than transcan assumes, since we are willing to
> assume linearity of the regression of each variable on the other
> variables. Other members proposed solutions that assumed this,
> but they did not take into account the fact that missing data at
> the high or low end of each variable (each member's ratings)
> would change its mean.
>
> Jon
> --
> Jonathan Baron, Professor of Psychology, University of Pennsylvania
> Home page: http://www.sas.upenn.edu/~baron
> R search page: http://finzi.psych.upenn.edu/
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list