[R] handling of missing values in aov/lm
R.J.V. Bertin
rjvbertin at vip.webmails.com
Fri Jun 28 17:02:27 CEST 2002
R provides a few ways of handling missing values, a.o. in the context of an
anova (aov); 2 types of exclusion, and failure.
In some situations, I personally like to have missing values replaced by
the mean (or the median) for the given combination of factors.
A routine that does that is something like the code included below. It
works, but is (of course) rather slow. It would be much quicker if sapply()
could be used -- and I imagine that somewhere in the "innards" of aov or lm
the data will have been broken up by factors such that sapply could be
applied. Is there a good (statistical or other) reason why there is no such
option? And alternatively, is there a more efficient solution than my code
below?
Thanks again,
--
RJV Bertin
NB: return address not valid; use r j v b e r t i n at h o t m a i l
dot c o m
df.Missing.Mean.VV1 <- function(df,verbose=F)
{ ## replace missing values in the dataframe df by the mean of the
corresponding column for each combination of the factors that interest us
here.
## have to find a more elegant fashion to find the factor columns!
Subjects<-length(levels(df$Snr))
## construct an array to receive the means for each combination of
the relevant factors:
Types<-length(levels(df$Type))
sizes<-length(levels(df$size))
Modalities<-length(levels(df$Modality))
replval<-rep(NA, Subjects*Types*sizes*Modalities)
dim(replval)<-c(Subjects,Types,sizes,Modalities)
nSS<-as.numeric(df$Snr)
nT<-as.numeric(df$Type)
nS<-as.numeric(df$size)
nM<-as.numeric(df$Modality)
for( i in 1:ncol(df) ){
m<-mean(df[,i],na.rm=T)
if( !is.na(m) ){
for( T in 1:Types ){
for( S in 1:sizes ){
for( M in 1:Modalities ){
m <- mean( df[,i][ nT==T & nS==S & nM==M ],
na.rm=T )
## subject-dependency should be redundant!
for( SS in 1:Subjects ){
replval[SS,T,S,M] <- m
}
}
}
}
for( j in 1:length(df[,i]) ){
if( is.na(df[,i][j]) ){
SS<-nSS[j] ; T<-nT[j] ; S<-nS[j] ; M<-nM[j]
if( verbose ){
print( paste( "df[,", i, ",", j, "] == NA <-",
# "mean(Snr=", SS, ",T=", T,
",S=",S,",M=",M,")==",
"mean(Snr=", df$Snr[j], ",T=",
df$Type[j], ",S=",df$size[j],",M=",df$Modality[j],")==",
replval[SS,T,S,M],
sep="" ))
}
df[,i][j]<-replval[SS,T,S,M]
}
}
}
}
rm(nSS,nT,nS,nM,replval)
df
}
-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list