[R] Systematic treatment of missing values

Tue May 30 10:34:02 CEST 2006

Thank you very much for your prompt reply and for adding the comments  
to the help pages for match and ==.  I think the source of my  
confusion was that by looking at the current documentation (v 2.3.0)  
I did not realize that matching is different from equality testing.   
(Obviously in the case of using regular expressions, etc, it is  
different, but I thought that when using plain "match" and %in%,  
matching would be determined by ==.)

Also I did not mean for my first comment to sound like a criticism of  
R for treating NAs inconsistently.  Nonetheless I am still curious  
why the particular choice was made that "match" (and therefore %in%)  
acts differently from "==" with respect to NA's and NaN's (with the  
default and the only implemented value of the "incomparables"  
parameter)?

Thank you,
David

On May 28, 2006, at 1:10 AM, Prof Brian Ripley wrote:

> You start with very general comments, but only use one specific  
> function, match (see ?"%in%", a help page entitled `value matching').
>
> Matching and equality are treated differently.  By definition, NA  
> matches NA and nothing else, and NaN matches NaN and nothing else.   
> In comparisons, these values are not comparable.
>
> As you will have seen from the help page, match() has the expansion  
> capacity for declaring values non-comparable.  That has not been  
> implemented for a decade and no one has supplied code to implement  
> it, so it seems no want has much need of it.
>
> I have added notes to the help pages for match and == to say  
> explicitly what matches and what is comparable.  If the *Draft* R  
> Language Definition were ever to be finished it would have such  
> details: it already has a useful commentary.
>
> On Sat, 27 May 2006, David Soloveichik wrote:
>
>> I am wondering whether there is a well-accepted approach to handling
>> missing values (NA's) in a programming language such as R.  For
>> example, most functions seem to propagate NA to the output when the
>> value of the missing entry could have mattered.  In other words, most
>> functions are not willing to "take a stand" on what the missing value
>> was.  However, some functions don't seem to do this.  For example,
>>
>> > c(1,2,3,NA) %in% c(2,3)
>> [1] FALSE  TRUE  TRUE FALSE
>>
>> rather than: FALSE  TRUE  TRUE NA
>>
>>
>> Also, what is the logic of the following:
>> > c(1,2,3,NA) %in% c(2,3,NA)
>> [1] FALSE  TRUE  TRUE  TRUE
>>
>> Why is the last output value TRUE?  Why does R claim that the NA on
>> the left hand side of %in% is the same as the NA on the right hand
>> side of %in%?
>
> It does not: it reports that it *matches*.  Please do read the help  
> page bwofre posting, as the posting guide asked you to.
>
>> PLEASE do read the posting guide! http://www.R-project.org/posting- 
>> guide.html
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595