[R] Replace NAs in one column with data from another column
Joshua Wiley
jwiley.psych at gmail.com
Wed Sep 8 21:56:57 CEST 2010
On Wed, Sep 8, 2010 at 12:02 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Sep 8, 2010, at 2:24 PM, Joshua Wiley wrote:
>
>> Hi Jakob,
>>
>> You can use is.na() to create an index of which rows in column 3 are
>> missing data, and then select these from column 1. Here is a simple
>> example:
>>
>> dat <- data.frame(V1 = 1:5, V3 = c(1, NA, 3, 4, NA))
>> dat$new <- dat$V3
>> my.na <- is.na(dat$V3)
>> dat$new[my.na] <- dat$V1[my.na]
>>
>> dat
>>
>> This should be quite fast. I broke the steps up to be explicit, but
>> you can readily simplify them.
>
> I was about to post something similar except I was going to avoid the "$"
> operator thinking, incorrectly as it turned out, that it would be faster. I
> also include the Holtman/Rizopoulos suggestion of ifelse(). I was also
> surprised that ifelse is the winning strategy:
That surprises me too. What I find really curious is the (relatively)
large difference between the dlr.sign and index methods. Some of the
difference is gained back if dat[, 4] <- dat[, 3] is used over dat[4]
<- dat[3]. But it still lags noticeably on my old clunker (with the
inventive name, index2) compared to dlr.sign:
# after failed attempts with benchmark::benchmark()
# I decided this is what you used
> library(rbenchmark)
> dat <- data.frame(V1 = 1:5, V3 = c(1, NA, 3, 4, NA))
> rbenchmark::benchmark(meth.ifelse = {dat$z.new <- ifelse(is.na(dat$V3), dat$V1, dat$V3)},
+ meth.dlr.sign = {dat$new <- dat$V3
+ my.na <- is.na(dat$V3)
+ dat$new[my.na] <- dat$V1[my.na]},
+ meth.index = {dat[4] <- dat[3]; idx <-is.na(dat[, 3])
+ dat[idx, 4] <- dat[idx, 1]},
+ meth.index2 = {dat[, 4] <- dat[, 3]; idx <-is.na(dat[, 3])
+ dat[idx, 4] <- dat[idx, 1]},
+ meth.forloop = {for (i in 1:nrow(dat)){
+ if(is.na(dat[i,2])==TRUE){
+ dat[i, 3] <- dat[i, 1]
+ } else { dat[i,3] <- dat[i,2]}}
+ },
+ replications=5000, columns = c("test", "replications", "elapsed",
+ "relative", "user.self"))
test replications elapsed relative user.self
2 meth.dlr.sign 5000 1.337 1.206679 1.216
5 meth.forloop 5000 16.941 15.289711 14.997
1 meth.ifelse 5000 1.108 1.000000 1.061
3 meth.index 5000 8.868 8.003610 7.164
4 meth.index2 5000 6.099 5.504513 5.136
>
> dat[4] <- dat[3]; idx <-is.na(dat[, 3])
> dat[is.na(dat[, 3]), 4] <- dat[is.na(dat[, 3]), 1]
>
>> benchmark(meth.ifelse = {dat$z.new <- ifelse(is.na(dat$V3), dat$V1,
>> dat$V3)},
> + meth.dlr.sign={dat$new <- dat$V3
> + my.na <- is.na(dat$V3)
> + dat$new[my.na] <- dat$V1[my.na]},
> + meth.index ={dat[4] <- dat[3]; idx <-is.na(dat[, 3])
> + dat[idx, 4] <- dat[idx, 1]},
> + meth.forloop ={for (i in 1:nrow(dat)){
> + if (is.na(dat[i,3])==TRUE){
> + dat[i,4]<- dat[i,1]}
> + else{
> + dat[i,4]<- dat[i,3]} }
> + },
> + replications=5000, columns = c("test", "replications", "elapsed",
> + "relative", "user.self") )
> test replications elapsed relative user.self
> 2 meth.dlr.sign 5000 0.502 1.081897 0.501
> 4 meth.forloop 5000 6.419 13.834052 6.409
> 1 meth.ifelse 5000 0.464 1.000000 0.463
> 3 meth.index 5000 2.908 6.267241 2.904
>
> --
> David.
>>
>> HTH,
>>
>> Josh
>>
>> On Wed, Sep 8, 2010 at 11:17 AM, Jakob Hedegaard
>> <Jakob.Hedegaard at agrsci.dk> wrote:
>>>
>>> Hi list,
>>>
>>> I have a data frame (m) with 169221 rows and 10 columns and would like to
>>> make a new column containing the content of column 3 but replace the NAs in
>>> column 3 with the data in column 1 (from the same row as the NA in column
>>> 3). Column 1 has data in all rows.
>>>
>>> My first attempt was:
>>>
>>> for (i in 1:169221){
>>> if (is.na(m[i,3])==TRUE){
>>> m[i,11] <- as.character(m[i,1])}
>>> else{
>>> m[i,11] <- as.character(m[i,3])}
>>> }
>>>
>>> Works - but takes too long time.
>>> I would appreciate alternative solutions.
>>>
>>> Best regards, Jakob
>>
> --
>
> David Winsemius, MD
> West Hartford, CT
>
>
--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/
More information about the R-help
mailing list