[Rd] 'merge' function: behavior w.r.t. NAs in the key column

Fri Mar 14 18:16:38 CET 2008

Hi,

I recently ran into a problem with 'merge' that stems from the way how 
missing values in the key column (i.e., the column specified
in the "by" argument) are handled. I wonder whether the current behavior 
is fully consistent.

Please have a look at this example:

> x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 )
> y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 )

> x
   key val
1   1  11
2   2  12
3   3  13
4   3  14
5  NA  15
6  NA  16

> y
   key val
1  NA  21
2   2  22
3   3  23
4   4  24
5   5  25
6   3  26
7  NA  27

> merge( x, y, by="key" )
   key val.x val.y
1   2    12    22
2   3    13    23
3   3    13    26
4   3    14    23
5   3    14    26
6  NA    15    21
7  NA    15    27
8  NA    16    21
9  NA    16    27

As one should expect, there are now four lines with key value '3',
because the key '3' appears twice both in x and in y. According to the
logic of merge, a row should be produced in the output for each pairing
of a row from x and a row from y where the values of 'key' are equal.

However, the 'NA' values are treated exactly the same way. It seems that 
'merge' considers the pairing of lines with 'NA' in both 'key' columns 
an allowed match. IMHO, this runs against the convention that two NAs 
are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.)

Is might be more consistent if merge did not include any rows into the 
output with an "NA" in the key column.

Maybe, one could add a flag argument to 'merge' to switch between this 
behaviour and the current one? A note in the help page might be nice, too.

Best regards
   Simon

+---
| Dr. Simon Anders, Dipl. Phys.
| European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| office phone +44-1223-494478, mobile phone +44-7505-841692
| preferred (permanent) e-mail: sanders at fs.tum.de