[R] replacing all NA's in a dataframe with zeros...

Thu Mar 15 09:08:42 CET 2007

On Wed, 2007-03-14 at 20:16 -0700, Steven McKinney wrote:
> Since you can index a matrix or dataframe with
> a matrix of logicals, you can use is.na()
> to index all the NA locations and replace them
> all with 0 in one command.
> 

A quicker solution, that, IIRC,  was posted to the list by Peter
Dalgaard several years ago is:

sapply(mydata.df, function(x) {x[is.na(x)] <- 0; x}))

Some timings on a larger problem with 100 columns:

> mydata.df <- as.data.frame(matrix(sample(c(as.numeric(NA), 1), 
                             size = 1000*100, replace = TRUE), 
                             nrow = 1000))

> system.time(retval <- sapply(mydata.df, 
                               function(x) {x[is.na(x)] <- 0; x}))
[1] 0.108 0.008 0.120 0.000 0.000

> system.time(mydata.df[is.na(mydata.df)] <- 0)
[1] 2.460 0.028 2.498 0.000 0.000

And a larger problem still, 1000 columns

> mydata.df <- as.data.frame(matrix(sample(c(as.numeric(NA), 1), 
                             size = 1000*1000, replace = TRUE), 
                             nrow = 1000))

> system.time(retval <- sapply(mydata.df, function(x) {x[is.na(x)] <- 0;
x}))
[1] 0.908 0.068 2.657 0.000 0.000
> system.time(mydata.df[is.na(mydata.df)] <- 0)
[1] 43.127  0.332 46.440  0.000  0.000

Profiling mydata.df[is.na(mydata.df)] <- 0 shows that it spends most of
this time subsetting the the individual cells of the data frame in turn
and setting the NA ones to 0.

HTH

G

> > mydata.df <- as.data.frame(matrix(sample(c(as.numeric(NA), 1), size = 30, replace = TRUE), nrow = 6))
> > mydata.df
>   V1 V2 V3 V4 V5
> 1  1 NA  1  1  1
> 2  1 NA NA NA  1
> 3 NA NA  1 NA NA
> 4 NA NA NA NA  1
> 5 NA  1 NA NA  1
> 6  1 NA NA  1  1
> > is.na(mydata.df)
>      V1    V2    V3    V4    V5
> 1 FALSE  TRUE FALSE FALSE FALSE
> 2 FALSE  TRUE  TRUE  TRUE FALSE
> 3  TRUE  TRUE FALSE  TRUE  TRUE
> 4  TRUE  TRUE  TRUE  TRUE FALSE
> 5  TRUE FALSE  TRUE  TRUE FALSE
> 6 FALSE  TRUE  TRUE FALSE FALSE
> > mydata.df[is.na(mydata.df)] <- 0
> > mydata.df
>   V1 V2 V3 V4 V5
> 1  1  0  1  1  1
> 2  1  0  0  0  1
> 3  0  0  1  0  0
> 4  0  0  0  0  1
> 5  0  1  0  0  1
> 6  1  0  0  1  1
> > 
> 
> Steven McKinney
> 
> Statistician
> Molecular Oncology and Breast Cancer Program
> British Columbia Cancer Research Centre
> 
> email: smckinney at bccrc.ca
> 
> tel: 604-675-8000 x7561
> 
> BCCRC
> Molecular Oncology
> 675 West 10th Ave, Floor 4
> Vancouver B.C. 
> V5Z 1L3
> Canada
> 
> 
> 
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch on behalf of David L. Van Brunt, Ph.D.
> Sent: Wed 3/14/2007 5:22 PM
> To: R-Help List
> Subject: [R] replacing all NA's in a dataframe with zeros...
>  
> I've seen how to  replace the NA's in a single column with a data frame
> 
> *> mydata$ncigs[is.na(mydata$ncigs)]<-0
> 
> *But this is just one column... I have thousands of columns (!) that I need
> to do this, and I can't figure out a way, outside of the dreaded loop, do
> replace all NA's in an entire data frame (all vars) without naming each var
> separately. Yikes.
> 
> I'm racking my brain on this, seems like I must be staring at the obvious,
> but it eludes me. Searches have come up CLOSE, but not quite what I need..
> 
> Any pointers?
> 
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [t] +44 (0)20 7679 0522
ECRC                              [f] +44 (0)20 7679 0565
UCL Department of Geography
Pearson Building                  [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street
London, UK                        [w] http://www.ucl.ac.uk/~ucfagls/
WC1E 6BT                          [w] http://www.freshwaters.org.uk/
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%