[R] Doubt in simple merge

Fri Jan 17 14:16:10 CET 2014

On Jan 16, 2014, at 11:14 PM, kingsly <ecokingsly at yahoo.co.in> wrote:

> Thank you dear friends.  You have cleared my first doubt.  
> 
> My second doubt:
> I have the same data sets "Elder" and "Younger". Elder <- data.frame(
>   ID=c("ID1","ID2","ID3"),
>   age=c(38,35,31))
> Younger <- data.frame(
>   ID=c("ID4","ID5","ID3"),
>   age=c(29,21,"NA"))
> 
> 
>  Row ID3 comes in both data set. It has a value (31) in "Elder" while "NA" in "Younger".
> 
> I need output like this.
> 
> ID    age
> ID1  38
> ID2  35
> ID3  31
> ID4  29
> ID5  21 
> 
> Kindly help me.

First, there is a problem with the way in which you created Younger, where you have the NA as "NA", which is a character and coerces the entire column to a factor, rather than a numeric:

> str(Younger)
'data.frame':	3 obs. of  2 variables:
 $ ID : Factor w/ 3 levels "ID3","ID4","ID5": 2 3 1
 $ age: Factor w/ 3 levels "21","29","NA": 2 1 3

It then causes problems in the default merge():

DF <- merge(Elder, Younger, by = c("ID", "age"), all = TRUE)

> str(DF)
'data.frame':	6 obs. of  2 variables:
 $ ID : Factor w/ 5 levels "ID1","ID2","ID3",..: 1 2 3 3 4 5
 $ age: chr  "38" "35" "31" "NA" ...

Note that 'age' becomes a character vector, again rather than numeric.

Thus:

Younger <- data.frame(ID = c("ID4", "ID5", "ID3"), age = c(29, 21, NA))

Now, when you merge as before, you get:

> str(merge(Elder, Younger, by = c("ID", "age"), all = TRUE))
'data.frame':	6 obs. of  2 variables:
 $ ID : Factor w/ 5 levels "ID1","ID2","ID3",..: 1 2 3 3 4 5
 $ age: num  38 35 31 NA 29 21

> merge(Elder, Younger, by = c("ID", "age"), all = TRUE)
   ID age
1 ID1  38
2 ID2  35
3 ID3  31
4 ID3  NA
5 ID4  29
6 ID5  21

Presuming that you want to consistently remove any NA values that may arise from either data frame:

> na.omit(merge(Elder, Younger, by = c("ID", "age"), all = TRUE))
   ID age
1 ID1  38
2 ID2  35
3 ID3  31
5 ID4  29
6 ID5  21

See ?na.omit

Regards,

Marc Schwartz