[Rd] Incorrect handling of NA's in cor() (PR#6750)

Fri Apr 9 20:37:23 CEST 2004

Marek Ancukiewicz <msa at biostat.mgh.harvard.edu> writes:

> Dear Thomas,
> 
> The question becomes: how do we rank missing values?  In
> version 1.8.1 at least, cor () uses default handling of
> missing values by rank() [by na.last parameter], that is
> missing values are assigned the highest rank. However, if
> nothing is known about the meaning of NA what would be the
> basis of such an assumption?  Assigning the NAs highest,
> lowest values, or any other values requires some additional
> information.
> 
> It seems that the default handling on missing values should be
> to assign them missing ranks: within cor(), rank() should be
> called with na.last="keep". 

Yes, and that is what 1.9.0beta is doing (it's not like this issue
hasn't been brought up before, just that the fix didn't quite fix it).
I think what we have now is still buggy, but at least it isn't biasing
rho towards +1 whenever x and y tend to be both missing at the same
time.

It's fairly easy to do something more sensible in the complete.cases
case, but getting pairwise.complete.cases right is tricky. 1.9.0
is in deep code freeze, so I don't think we should change things at
this point, except perhaps add a note to the help page.

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907