[R] Imputing missing values
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Wed Sep 1 14:10:54 CEST 2004
Dimitris Rizopoulos wrote:
> Hi Jan,
>
> you could try the following:
>
> dat <- data.frame(Price=c(10,12,NA,8,7,9,NA,9,NA),
> Crop=c(rep("Rise", 5), rep("Wheat", 4)),
> Season=c(rep("Summer", 3), rep("Winter", 4),
> rep("Summer", 2)))
> ######
> dat <- dat[order(dat$Season, dat$Crop),]
> dat$Price.imp <- unlist(tapply(dat$Price, list(dat$Crop, dat$Season),
> function(x){
> mx <- mean(x, na.rm=TRUE)
> ifelse(is.na(x), mx, x)
> }))
>
> dat
>
> However, you should be careful using this imputation technique since
> you don't take into account the extra variability of imputing new
> values in your data set. I don't know what analysis are you planning
> to do but in any case I would recommend to read some standard
> references for missing values, e.g., Little, R. and Rubin, D. (2002).
> Statistical Analysis with Missing Data, New York: Wiley.
>
> I hope this helps.
>
> Best,
> Dimitris
>
> ----
> Dimitris Rizopoulos
> Doctoral Student
> Biostatistical Centre
> School of Public Health
> Catholic University of Leuven
>
> Address: Kapucijnenvoer 35, Leuven, Belgium
> Tel: +32/16/396887
> Fax: +32/16/337015
> Web: http://www.med.kuleuven.ac.be/biostat/
> http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
>
>
> ----- Original Message -----
> From: "Jan Smit" <janpsmit at yahoo.co.uk>
> To: <R-help at stat.math.ethz.ch>
> Sent: Wednesday, September 01, 2004 10:43 AM
> Subject: [R] Imputing missing values
>
>
>
>>Dear all,
>>
>>Apologies for this beginner's question. I have a
>>variable Price, which is associated with factors
>>Season and Crop, each of which have several levels.
>>The Price variable contains missing values (NA), which
>>I want to substitute by the mean of the remaining
>>(non-NA) Price values of the same Season-Crop
>>combination of levels.
>>
>>Price Crop Season
>>10 Rice Summer
>>12 Rice Summer
>>NA Rice Summer
>>8 Rice Winter
>>9 Wheat Summer
>>
>>Price[is.na(Price)] gives me the missing values, and
>>by(Price, list(Crop, Season), mean, na.rm = T) the
>>values I want to impute. What I've not been able to
>>figure out, by looking at by and the various
>>incarnations of apply, is how to do the actual
>>substitution.
>>
>>Any help would be much appreciated.
>>
>>Jan Smit
Or see the impute function in the Hmisc package and more general
solutions also in Hmisc.
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list