[Rd] Incorrect handling of NA's in cor() (PR#6750)
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Fri Apr 9 20:37:23 CEST 2004
Marek Ancukiewicz <msa at biostat.mgh.harvard.edu> writes:
> Dear Thomas,
>
> The question becomes: how do we rank missing values? In
> version 1.8.1 at least, cor () uses default handling of
> missing values by rank() [by na.last parameter], that is
> missing values are assigned the highest rank. However, if
> nothing is known about the meaning of NA what would be the
> basis of such an assumption? Assigning the NAs highest,
> lowest values, or any other values requires some additional
> information.
>
> It seems that the default handling on missing values should be
> to assign them missing ranks: within cor(), rank() should be
> called with na.last="keep".
Yes, and that is what 1.9.0beta is doing (it's not like this issue
hasn't been brought up before, just that the fix didn't quite fix it).
I think what we have now is still buggy, but at least it isn't biasing
rho towards +1 whenever x and y tend to be both missing at the same
time.
It's fairly easy to do something more sensible in the complete.cases
case, but getting pairwise.complete.cases right is tricky. 1.9.0
is in deep code freeze, so I don't think we should change things at
this point, except perhaps add a note to the help page.
--
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-devel
mailing list