[R] Unexpected behavior using `merge' by multiple columns

Eric DeWitt edewitt at hampshire.edu
Thu May 16 01:58:31 CEST 2013


I recently started using `merge()' to combine data frames that contained
different conditioned subsets of the same master data frame. In some conditions
the `by' columns had `NA' values and I was using `incomparables=c(NA, NaN)' to
avoid including these rows in the resulting merge. However, the behavior never
appeared to match the documentation. I continued to received partial matching
on columns that included `NA's. I attempted to understand the expected behavior
by looking at the example from the documentation and it appears to me to have
the same problem. Am I misinterpreting the documentation or is the behavior
inconsistent? Below is a reproducible example from the `merge' documentation.
The modified version run at the end uses the fix suggested in the example.

Best,
Eric

### base::merge potentially unexpected behavior

# The `merge' argument `incomparables' does not appear to behave in the manner
# described in the documentation. Including `NA' in comparables does not prevent
# NA matching.

# example from the documentation:

## example of using `incomparables'
x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)
y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)
merge(x, y, by = c("k1","k2")) # NA's match
  k1 k2 data.x data.y
1  4  4      4      4
2  5  5      5      5
3 NA NA      2      1
merge(x, y, by = c("k1","k2"), incomparables = NA)
  k1 k2 data.x data.y
1  4  4      4      4
2  5  5      5      5
3 NA NA      2      1
merge(x, y, by = "k1") # NA's match, so 6 rows
  k1 k2.x data.x k2.y data.y
1  4    4      4    4      4
2  5    5      5    5      5
3 NA    1      1   NA      1
4 NA    1      1    3      3
5 NA   NA      2   NA      1
6 NA   NA      2    3      3
merge(x, y, by = "k2", incomparables = NA) # 2 rows
  k2 k1.x data.x k1.y data.y
1  4    4      4    4      4
2  5    5      5    5      5

# Observing that when merging on two columns the `incomparables' appears to have
# no effect, it appears that the problem is located in the creation of the
# common index into the two data frames. debugging merge reveals that the
# problem is in the construction via paste which converts `NA' to "NA":
#
# debug: bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
# Browse[2]bz
#	[1] "NA\r1"	"NA\rNA" "3\rNA"	"4\r4"	 "5\r5"	 "NA\rNA" "2\rNA"	"NA\r3" 
#	[9] "4\r4"	 "5\r5"	
# debug: bx <- bz[seq_len(nx)]
# debug: by <- bz[nx + seq_len(ny)]
# Browse[2]bx
# [1] "NA\r1"  "NA\rNA" "3\rNA"  "4\r4"   "5\r5"  
# Browse[2]by
# [1] "NA\rNA" "2\rNA"  "NA\r3"  "4\r4"   "5\r5"  
# 
# which produces "NA\rNA" matches
#
# debug: comm <- match(bx, by, 0L)
# Browse[2]comm
# [1] 0 1 0 4 5
# 
# Given that this appears to the be intended behavior of `paste', the solution
# appears to require that the elements in `incomparables' be removed from the
# rows after to the paste operation. The following is an example (perhaps
# inefficient) that would solve the problem:
# 
# bx <- x[, by.x, drop = FALSE]
# by <- y[, by.y, drop = FALSE]
# names(bx) <- names(by) <- paste0("V", seq_len(ncol(bx)))
# bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
# bx <- bz[seq_len(nx)]
# by <- bz[nx + seq_len(ny)]
# bx[apply(is.na(x),1,any)] <- NA
# by[apply(is.na(y),1,any)] <- NA
# comm <- match(bx, by, 0L, incomparables)
#
# The resulting patched merge produces:
source("merge.fixed.R")
merge.fixed(x, y, by = c("k1","k2")) # NA's match
   k1 k2 data.x data.y
1   4  4      4      4
2   5  5      5      5
3  NA  1      1      1
4  NA  1      1      2
5  NA  1      1      3
6  NA NA      2      1
7  NA NA      2      2
8  NA NA      2      3
9   3 NA      3      1
10  3 NA      3      2
11  3 NA      3      3
merge.fixed(x, y, by = c("k1","k2"), incomparables = NA)
  k1 k2 data.x data.y
1  4  4      4      4
2  5  5      5      5
merge.fixed(x, y, by = "k1") # NA's match, so 6 rows
  k1 k2.x data.x k2.y data.y
1  4    4      4    4      4
2  5    5      5    5      5
3 NA    1      1   NA      1
4 NA    1      1    3      3
5 NA   NA      2   NA      1
6 NA   NA      2    3      3
merge.fixed(x, y, by = "k2", incomparables = NA) # 2 rows
  k2 k1.x data.x k1.y data.y
1  4    4      4    4      4
2  5    5      5    5      5

# This appears to match the documented behavior.

# R version
version
               _                           
platform       x86_64-apple-darwin10.8.0   
arch           x86_64                      
os             darwin10.8.0                
system         x86_64, darwin10.8.0        
status                                     
major          3                           
minor          0.0                         
year           2013                        
month          04                          
day            03                          
svn rev        62481                       
language       R                           
version.string R version 3.0.0 (2013-04-03)
nickname       Masked Marvel               
sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base



More information about the R-help mailing list