[R] impute missing values in correlated variables: transcan?

Tue Nov 30 18:23:26 CET 2004

At the risk of stirring up a hornet's nest , I'd suggest that
means are dangerous in such applications.  A nice paper
on combining ratings is:  Gilbert Bassett and Joseph  Persky,
Rating Skating,  JASA, 1994,  1075-1079.

url:	www.econ.uiuc.edu/~roger        	Roger Koenker
email	rkoenker at uiuc.edu			Department of Economics
vox: 	217-333-4558				University of Illinois
fax:   	217-244-6678				Champaign, IL 61820

On Nov 30, 2004, at 10:52 AM, Jonathan Baron wrote:

> I would like to impute missing data in a set of correlated
> variables (columns of a matrix).  It looks like transcan() from
> Hmisc is roughly what I want.  It says, "transcan automatically
> transforms continuous and categorical variables to have maximum
> correlation with the best linear combination of the other
> variables." And, "By default, transcan imputes NAs with "best
> guess" expected values of transformed variables, back transformed
> to the original scale."
>
> But I can't get it to work.  I say
>
> m1 <- matrix(1:20+rnorm(20),5,)  # four correlated variables
> colnames(m1) <- paste("R",1:4,sep="")
> m1[c(2,19)] <- NA                # simulate some missing data
> library(Hmisc)
> transcan(m1,data=m1)
>
> and I get
>
> Error in rcspline.eval(y, nk = nk, inclx = TRUE) :
>       fewer than 6 non-missing observations with knots omitted
>
> I've tried a few other things, but I think it is time to ask for
> help.
>
> The specific problem is a real one.  Our graduate admissions
> committee (4 members) rates applications, and we average the
> ratings to get an overall rating for each applicant.  Sometimes
> one of the committee members is absent, or late; hence the
> missing data.  The members differ in the way they use the rating
> scale, in both slope and intercept (if you regress each on the
> mean).  Many decisions end up depending on the second decimal
> place of the averages, so we want to do better than just averging
> the non-missing ratings.
>
> Maybe I'm just not seeing something really simple.  In fact, the
> problem is simpler than transcan assumes, since we are willing to
> assume linearity of the regression of each variable on the other
> variables.  Other members proposed solutions that assumed this,
> but they did not take into account the fact that missing data at
> the high or low end of each variable (each member's ratings)
> would change its mean.
>
> Jon
> -- 
> Jonathan Baron, Professor of Psychology, University of Pennsylvania
> Home page: http://www.sas.upenn.edu/~baron
> R search page: http://finzi.psych.upenn.edu/
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html