[R] replace Na values with the mean of the column which contains them

Mon Jul 29 20:40:35 CEST 2013

Replacements are a case where I think an explicit for-loop is better than sapply or any
other *apply function.  The for-loop will make the output resemble the output: while
sapply and friends will mangle the class, dimnames, and other attributes of the input.
Also, if you want to replace the NA's by the mean of the containing row then you have
to use t() on sapply's output.

E.g.
  > d <- cbind(AllNAs=NA, NoNAs=c(i=1,ii=2,iii=3,iv=4,v=5), SomeNAs=rep(c(100,NA),len=5))
  > f1 <- function(de)sapply(seq_len(ncol(de)),function(i) {de[,i][is.na(de[,i])]<-mean(de[,i],na.rm=TRUE);de[,i]})
  > f2 <- function(de) { for(i in seq_len(ncol(de))) de[is.na(de[,i]),i] <- mean(de[,i], na.rm=TRUE) ; de }
  > str(f1(d)) # no column names
   num [1:5, 1:3] NaN NaN NaN NaN NaN 1 2 3 4 5 ...
   - attr(*, "dimnames")=List of 2
    ..$ : chr [1:5] "i" "ii" "iii" "iv" ...
    ..$ : NULL
  > str(f2(d))
   num [1:5, 1:3] NaN NaN NaN NaN NaN 1 2 3 4 5 ...
   - attr(*, "dimnames")=List of 2
    ..$ : chr [1:5] "i" "ii" "iii" "iv" ...
    ..$ : chr [1:3] "AllNAs" "NoNAs" "SomeNAs"

  > df <- data.frame(AllNAs=NA, NoNAs=c(i=1,ii=2,iii=3,iv=4,v=5), SomeNAs=rep(c(100+1i,NA),len=5))
  > str(f1(df)) # matrix of complex, not data.frame
   cplx [1:5, 1:3] NaN+0i NaN+0i NaN+0i ...
  > str(f2(df))
  'data.frame':   5 obs. of  3 variables:
   $ AllNAs : num  NaN NaN NaN NaN NaN
   $ NoNAs  : num  1 2 3 4 5
   $ SomeNAs: cplx  100+1i 100+1i 100+1i ...

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of arun
> Sent: Monday, July 29, 2013 10:58 AM
> To: iza.ch1
> Cc: R help
> Subject: Re: [R] replace Na values with the mean of the column which contains them
> 
> Hi,
> 
> de<- structure(c(NA, NA, NA, NA, NA, NA, NA, NA, 0.27500571, -3.07568579,
> -0.42240954, -0.26901731, 0.01766284, -0.8099958, 0.20805934,
> 0.03036708, -0.26928087, 1.20925752, 0.38012008, -0.41778861,
> -0.49677462, -0.13248754, -0.54179054, 0.35788624, -0.41467591,
> -0.59234248, 0.73642396, -0.06768044, -0.40321968, -1.52283305,
> 0.25974308, -0.0401373, -0.1192078, 0.9325334, -1.8927164, 1.4330507,
> 0.2892706, 1.3976522, 0.2295291, -0.5009389, -0.342656, -0.8439027,
> -0.4971999, -1.6127122, -0.6508823, 1.4729576, -1.6093478, 0.1686006
> ), .Dim = c(16L, 3L))
> 
> 
> Your code should be:
> sapply(seq_len(ncol(de)),function(i) {de[,i][is.na(de[,i])]<-
> mean(de[,i],na.rm=TRUE);de[,i]})
> A.K.
> 
> 
> 
> 
> Hi everyone
> 
> I have a problem with replacing the NA values with the mean of
> the column which contains them. If I replace Na with the means of the
> rest values in the column, the mean of the whole column will be still
> the same as if I would have omitted NA values. I have the following data
> 
> de
>      [,1]        [,2]       [,3]
>  [1,]          NA -0.26928087 -0.1192078
>  [2,]          NA  1.20925752  0.9325334
>  [3,]          NA  0.38012008 -1.8927164
>  [4,]          NA -0.41778861  1.4330507
>  [5,]          NA -0.49677462  0.2892706
>  [6,]          NA -0.13248754  1.3976522
>  [7,]          NA -0.54179054  0.2295291
>  [8,]          NA  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
> 
> and I wrote the code
> de[which(is.na(de))]<-sapply(seq_len(ncol(de)),function(i) {mean(de[,i],na.rm=TRUE)})
> 
> I get as the result
>    [,1]        [,2]       [,3]
>  [1,] -0.50575168 -0.26928087 -0.1192078
>  [2,] -0.12222376  1.20925752  0.9325334
>  [3,] -0.13412312  0.38012008 -1.8927164
>  [4,] -0.50575168 -0.41778861  1.4330507
>  [5,] -0.12222376 -0.49677462  0.2892706
>  [6,] -0.13412312 -0.13248754  1.3976522
>  [7,] -0.50575168 -0.54179054  0.2295291
>  [8,] -0.12222376  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
> 
> It has replaced the NA values in first column with mean of first
>  column -0.505... and second cell with mean of second column etc.
> I want to have the result like this:
> [,1]        [,2]       [,3]
>  [1,] -0.50575168 -0.26928087 -0.1192078
>  [2,] -0.50575168  1.20925752  0.9325334
>  [3,] -0.50575168  0.38012008 -1.8927164
>  [4,] -0.50575168 -0.41778861  1.4330507
>  [5,] -0.50575168 -0.49677462  0.2892706
>  [6,] -0.50575168 -0.13248754  1.3976522
>  [7,] -0.50575168 -0.54179054  0.2295291
>  [8,] -0.50575168  0.35788624 -0.5009389
>  [9,]  0.27500571 -0.41467591 -0.3426560
> [10,] -3.07568579 -0.59234248 -0.8439027
> [11,] -0.42240954  0.73642396 -0.4971999
> [12,] -0.26901731 -0.06768044 -1.6127122
> [13,]  0.01766284 -0.40321968 -0.6508823
> [14,] -0.80999580 -1.52283305  1.4729576
> [15,]  0.20805934  0.25974308 -1.6093478
> [16,]  0.03036708 -0.04013730  0.1686006
> 
> Thanks in advance
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.