[R] how to improve this inefficient R code for imputing missing values
Coen van Hasselt
coenvanhasselt at gmail.com
Fri Nov 19 16:34:04 CET 2010
Hello all,
I have a big data.frame multiple studies, subjects and timepoints per
subject, i.e.
STUDY[,1] SUBJECT[,2] ...... WT[,16] HT[,17] TEMP[,18] BSA[,19]
1 1 50 170 37
1.90
1 1 NA NA NA
NA
1 1 52 170 38
1.94
In this dataset, three types of missing (demographic) values exist:
1) first value for a subject is missing:
ie. study 1, subject 1: mis X1 X2 X3.
Here I want to carry the first non-missing value backwards to the missing value.
2) last values for a subject is missing:
ie. study 1, subject 1: X1 X2 X3 mis.
Here I want to carry the last non-missing value forwards to the missing value
3) some "intermediate" value for a subject is missing (like example
data.frame above)
i.e. study 1, subject 1: X1 mis X2 X3.
Here I want to impute the missing value with the mean value between X1 and X2
The missing value is actually a subset of columns in the data frame,
ie. always the columns WT HT TEMP BSA (m[,16:19]) are missing
altogether.
I have written some R code that tries to do this, but it is incredibly
slow due to the many for-loops and the big dataset I have (and might
not even be completely correct yet).
QUESTION:
I would greatly appreciate it if somebody can be give me some
guidance/hints on what direction I should roughly think for coding the
above a little more efficient then the horribly inefficient code
pasted below.
Thank you in advance and best regards,
Coen
for(s in unique(m$Study)){ # for each study
for(i in unique(m$Subject[m$Study==s & is.na(m$Wt)])){ # for each
subject with a missing value (if $Wt is missing, all 4 columns 16:19
are missing)
vals<-which(m$Study==s & m$Subject==i & !is.na(m$Wt)) # values
with NO missing values
for(w in which(m$Study==s & m$Subject==i & is.na(m$Wt))){ # for
each value that is missing for subject "i" and study "s"
if(w < min(vals) ){ # FIRST VALUES MISSING ? #
carry the backwards
m[w,][16:19]<-m[min(vals),][16:19]
} else if(w > max(vals) ) { # LAST VALUES MISSING #
carry forwards
m[w,][16:19]<-m[max(vals),][16:19]
} else { # INTERMEDIATE VALUES MISSING #
impute missing with mean
maxV<-min(vals[vals>w])
minV<-max(vals[vals<w])
m[w,][16:19]<- mean(m[c(maxV,minV),][16:19],na.rm=T)
}
}
}
}
More information about the R-help
mailing list