[R] Unexpected behavior using `merge' by multiple columns
Eric DeWitt
edewitt at hampshire.edu
Thu May 16 01:58:31 CEST 2013
I recently started using `merge()' to combine data frames that contained
different conditioned subsets of the same master data frame. In some conditions
the `by' columns had `NA' values and I was using `incomparables=c(NA, NaN)' to
avoid including these rows in the resulting merge. However, the behavior never
appeared to match the documentation. I continued to received partial matching
on columns that included `NA's. I attempted to understand the expected behavior
by looking at the example from the documentation and it appears to me to have
the same problem. Am I misinterpreting the documentation or is the behavior
inconsistent? Below is a reproducible example from the `merge' documentation.
The modified version run at the end uses the fix suggested in the example.
Best,
Eric
### base::merge potentially unexpected behavior
# The `merge' argument `incomparables' does not appear to behave in the manner
# described in the documentation. Including `NA' in comparables does not prevent
# NA matching.
# example from the documentation:
## example of using `incomparables'
x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)
y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)
merge(x, y, by = c("k1","k2")) # NA's match
k1 k2 data.x data.y
1 4 4 4 4
2 5 5 5 5
3 NA NA 2 1
merge(x, y, by = c("k1","k2"), incomparables = NA)
k1 k2 data.x data.y
1 4 4 4 4
2 5 5 5 5
3 NA NA 2 1
merge(x, y, by = "k1") # NA's match, so 6 rows
k1 k2.x data.x k2.y data.y
1 4 4 4 4 4
2 5 5 5 5 5
3 NA 1 1 NA 1
4 NA 1 1 3 3
5 NA NA 2 NA 1
6 NA NA 2 3 3
merge(x, y, by = "k2", incomparables = NA) # 2 rows
k2 k1.x data.x k1.y data.y
1 4 4 4 4 4
2 5 5 5 5 5
# Observing that when merging on two columns the `incomparables' appears to have
# no effect, it appears that the problem is located in the creation of the
# common index into the two data frames. debugging merge reveals that the
# problem is in the construction via paste which converts `NA' to "NA":
#
# debug: bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
# Browse[2]bz
# [1] "NA\r1" "NA\rNA" "3\rNA" "4\r4" "5\r5" "NA\rNA" "2\rNA" "NA\r3"
# [9] "4\r4" "5\r5"
# debug: bx <- bz[seq_len(nx)]
# debug: by <- bz[nx + seq_len(ny)]
# Browse[2]bx
# [1] "NA\r1" "NA\rNA" "3\rNA" "4\r4" "5\r5"
# Browse[2]by
# [1] "NA\rNA" "2\rNA" "NA\r3" "4\r4" "5\r5"
#
# which produces "NA\rNA" matches
#
# debug: comm <- match(bx, by, 0L)
# Browse[2]comm
# [1] 0 1 0 4 5
#
# Given that this appears to the be intended behavior of `paste', the solution
# appears to require that the elements in `incomparables' be removed from the
# rows after to the paste operation. The following is an example (perhaps
# inefficient) that would solve the problem:
#
# bx <- x[, by.x, drop = FALSE]
# by <- y[, by.y, drop = FALSE]
# names(bx) <- names(by) <- paste0("V", seq_len(ncol(bx)))
# bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
# bx <- bz[seq_len(nx)]
# by <- bz[nx + seq_len(ny)]
# bx[apply(is.na(x),1,any)] <- NA
# by[apply(is.na(y),1,any)] <- NA
# comm <- match(bx, by, 0L, incomparables)
#
# The resulting patched merge produces:
source("merge.fixed.R")
merge.fixed(x, y, by = c("k1","k2")) # NA's match
k1 k2 data.x data.y
1 4 4 4 4
2 5 5 5 5
3 NA 1 1 1
4 NA 1 1 2
5 NA 1 1 3
6 NA NA 2 1
7 NA NA 2 2
8 NA NA 2 3
9 3 NA 3 1
10 3 NA 3 2
11 3 NA 3 3
merge.fixed(x, y, by = c("k1","k2"), incomparables = NA)
k1 k2 data.x data.y
1 4 4 4 4
2 5 5 5 5
merge.fixed(x, y, by = "k1") # NA's match, so 6 rows
k1 k2.x data.x k2.y data.y
1 4 4 4 4 4
2 5 5 5 5 5
3 NA 1 1 NA 1
4 NA 1 1 3 3
5 NA NA 2 NA 1
6 NA NA 2 3 3
merge.fixed(x, y, by = "k2", incomparables = NA) # 2 rows
k2 k1.x data.x k1.y data.y
1 4 4 4 4 4
2 5 5 5 5 5
# This appears to match the documented behavior.
# R version
version
_
platform x86_64-apple-darwin10.8.0
arch x86_64
os darwin10.8.0
system x86_64, darwin10.8.0
status
major 3
minor 0.0
year 2013
month 04
day 03
svn rev 62481
language R
version.string R version 3.0.0 (2013-04-03)
nickname Masked Marvel
sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
More information about the R-help
mailing list