[R] handling of missing values in aov/lm

Fri Jun 28 17:02:27 CEST 2002

R provides a few ways of handling missing values, a.o. in the context of an 
anova (aov); 2 types of exclusion, and failure.

In some situations, I personally like to have missing values replaced by 
the mean (or the median) for the given combination of factors.
A routine that does that is something like the code included below. It 
works, but is (of course) rather slow. It would be much quicker if sapply() 
could be used -- and I imagine that somewhere in the "innards" of aov or lm 
the data will have been broken up by factors such that sapply could be 
applied. Is there a good (statistical or other) reason why there is no such 
option? And alternatively, is there a more efficient solution than my code 
below?

Thanks again,

-- 
RJV Bertin

NB: return address not valid; use  r j v b e r t i n  at  h o t m a i l  
dot  c o m

df.Missing.Mean.VV1 <- function(df,verbose=F)
{ ## replace missing values in the dataframe df by the mean of the 
corresponding column for each combination of the factors that interest us 
here.
  ## have to find a more elegant fashion to find the factor columns!
     Subjects<-length(levels(df$Snr))
       ## construct an array to receive the means for each combination of 
the relevant factors:
     Types<-length(levels(df$Type))
     sizes<-length(levels(df$size))
     Modalities<-length(levels(df$Modality))
     replval<-rep(NA, Subjects*Types*sizes*Modalities)
     dim(replval)<-c(Subjects,Types,sizes,Modalities)
     nSS<-as.numeric(df$Snr)
     nT<-as.numeric(df$Type)
     nS<-as.numeric(df$size)
     nM<-as.numeric(df$Modality)
     for( i in 1:ncol(df) ){
          m<-mean(df[,i],na.rm=T)
          if( !is.na(m) ){
               for( T in 1:Types ){
                    for( S in 1:sizes ){
                         for( M in 1:Modalities ){
                              m <- mean( df[,i][ nT==T & nS==S & nM==M ], 
na.rm=T )
                                ## subject-dependency should be redundant!
                              for( SS in 1:Subjects ){
                                   replval[SS,T,S,M] <- m
                              }
                         }
                    }
               }
               for( j in 1:length(df[,i]) ){
                    if( is.na(df[,i][j]) ){
                         SS<-nSS[j] ; T<-nT[j] ; S<-nS[j] ; M<-nM[j]
                         if( verbose ){
                              print( paste( "df[,", i, ",", j, "] == NA <-",
#                                       "mean(Snr=", SS, ",T=", T, 
",S=",S,",M=",M,")==",
                                        "mean(Snr=", df$Snr[j], ",T=", 
df$Type[j], ",S=",df$size[j],",M=",df$Modality[j],")==",
                                        replval[SS,T,S,M],
                              sep="" ))
                         }
                         df[,i][j]<-replval[SS,T,S,M]
                    }
               }
          }
     }
     rm(nSS,nT,nS,nM,replval)
     df
}

-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._