[R] replacing all NA's in a dataframe with zeros...

Joerg van den Hoff j.van_den_hoff at fzd.de
Thu Mar 15 11:30:20 CET 2007


On Thu, Mar 15, 2007 at 10:21:22AM +0100, Peter Dalgaard wrote:
> Gavin Simpson wrote:
> > On Wed, 2007-03-14 at 20:16 -0700, Steven McKinney wrote:
> >   
> >> Since you can index a matrix or dataframe with
> >> a matrix of logicals, you can use is.na()
> >> to index all the NA locations and replace them
> >> all with 0 in one command.
> >>
> >>     
> >
> > A quicker solution, that, IIRC,  was posted to the list by Peter
> > Dalgaard several years ago is:
> >
> > sapply(mydata.df, function(x) {x[is.na(x)] <- 0; x}))
> >   
> I hope your memory fails you, because it doesn't actually work.....
> 
> > sapply(test.df, function(x) {x[is.na(x)] <- 0; x})
>      x1 x2 x3
> [1,]  0  1  1
> [2,]  2  2  0
> [3,]  3  3  0
> [4,]  0  4  4
> 
> is a matrix, not a data frame.
> 
> Instead:
> 
> > test.df[] <- lapply(test.df, function(x) {x[is.na(x)] <- 0; x})
> > test.df
>   x1 x2 x3
> 1  0  1  1
> 2  2  2  0
> 3  3  3  0
> 4  0  4  4
> 
> Speedwise, sapply() is doing lapply() internally, and the assignment
> overhead should be small, so I'd expect similar timings.

just an idea:
given the order of magnitude difference (factor 17 or so) in runtime 
between the "obvious" solution and the fast one: would'nt it be possible/sensible
to modify the corresponding subsetting method ("[.data.frame") such that it
recognizes the case when it is called with an arbitrary index matrix (the
problem is not restricted to indexing with a logical matrix, I presume?) and
switch internally to the fast solution given above?

in my (admittedly limited) experience it seems that one of the not so nice
properties of R is that one encounters in quite a few situations exactly the above
situation: unexpected massive differences in run time between different solutions (I'm not
talking about "explicit loop penalty"). what concerns me most, are the very
basic scenarios (not complex algorithms): data frames vs. matrices, naming
vector components or not, subsetting, read.table vs. scan, etc. if their were a
concise HOW TO list for the cases "when speed matters", that would be helpful,
too.

I understand that part of the "uneven performance" is unavoidable and one must
expect the user to go to the trouble to understand the reasons, e.g. for
differences between handling purely numerical data in either matrices or data
frames. but a factor of 17 between the obvious approach and the wise one seems
a trap in which 99% of the people will step (probably never thinking that their
might be a faster approach).

joerg



More information about the R-help mailing list