[Rd] 'merge' function: behavior w.r.t. NAs in the key column

Simon Anders anders at ebi.ac.uk
Tue Mar 18 18:22:34 CET 2008


Hi Bill,

Bill Dunlap wrote:
> Splus (versions 8.0, 7.0, and 6.2) gives:
>    > merge( x, y, by="key" )
>      key val.x val.y
>    1   2    12    22
>    2   3    13    23
>    3   3    14    23
>    4   3    13    26
>    5   3    14    26
> Is that what you expect?  There is no argument
> to Splus's merge to make it include the NA's
> in the way R's merge does.  Should there be such
> an argument?

Yes, this is what I would expect.

Would it be reasonable to consider Splus's behavior as correct and R's 
behavior as inconsistent, and hence ask R's 'merge' function to be fixed?

Cheers
   Simon

> On Fri, 14 Mar 2008, Simon Anders wrote:
>> I recently ran into a problem with 'merge' that stems from the way how
>> missing values in the key column (i.e., the column specified
>> in the "by" argument) are handled. I wonder whether the current behavior
>> is fully consistent.
>> ...
>>> x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 )
>>> y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 )
>> ...
>>> merge( x, y, by="key" )
>>    key val.x val.y
>> 1   2    12    22
>> 2   3    13    23
>> 3   3    13    26
>> 4   3    14    23
>> 5   3    14    26
>> 6  NA    15    21
>> 7  NA    15    27
>> 8  NA    16    21
>> 9  NA    16    27
>>
>> As one should expect, there are now four lines with key value '3',
>> because the key '3' appears twice both in x and in y. According to the
>> logic of merge, a row should be produced in the output for each pairing
>> of a row from x and a row from y where the values of 'key' are equal.
>>
>> However, the 'NA' values are treated exactly the same way. It seems that
>> 'merge' considers the pairing of lines with 'NA' in both 'key' columns
>> an allowed match. IMHO, this runs against the convention that two NAs
>> are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.)
>>
>> Is might be more consistent if merge did not include any rows into the
>> output with an "NA" in the key column.
>>
>> Maybe, one could add a flag argument to 'merge' to switch between this
>> behaviour and the current one? A note in the help page might be nice, too.



More information about the R-devel mailing list