[R] bug? in stats::cor for use=complete.obs with NAs
Peter Ehlers
ehlers at ucalgary.ca
Fri Jun 11 01:02:36 CEST 2010
I don't think that this would be considered a bug. The
reason for the discrepancy between use="complete.obs"
and use="pairwise.complete.obs" for the case of the
Spearman correlation of two vectors x, y is this:
"pairwise" does complete.cases(x,y) and then ranks;
this is also what's done in cor.test().
"complete" ranks first (keeping NAs via the
na.last="keep" argument to rank()) and then does
complete.cases(ranked.x,ranked.y) on the ranked data.
This can obviously lead to a different set of
ranks being correlated than those for "pairwise".
I must admit that I wasn't aware that R does this
and I don't know the rationale for it. The help page
says:
If use is "complete.obs" then missing values are
handled by casewise deletion ...
which is not clear on the order of ranking and
deletion, but further down the page:
Note that "spearman" basically computes cor(R(x), R(y))
(or cov(.,.)) where R(u) := rank(u, na.last="keep").
In the case of missing values, the ranks are calculated
depending on the value of use, either based on complete
observations, or based on pairwise completeness with
reranking for each pair.
I guess that this implies that, for "complete", the ranking
occurs before the casewise deletion (else why the
na.last="keep"?).
If anyone knows the rationale and/or can give a reference,
I'd be glad to receive such.
-Peter Ehlers
On 2010-06-09 11:36, hugh.genin at thomsonreuters.com wrote:
> Arrrrr,
>
> I think I've found a bug in the behavior of the stats::cor function when
> NAs are present, but in case I'm missing something, could you look over
> this example and let me know what you think:
>
>
>> a = c(1,3,NA,1,2)
>> b = c(1,2,1,1,4)
>> cor(a,b,method="spearman", use="complete.obs")
> [1] 0.8164966
>> cor(a,b,method="spearman", use="pairwise.complete.obs")
> [1] 0.7777778
>
> My understanding is that, when the inputs are vectors (but not
> necessarily when they're matrices), the "complete.obs" and
> "pairwise.complete.obs" arguments should give identical spearman
> correlations. The above example clearly shows they do not in my version
> of R (2.11.1). However, in cor.test, they do:
>
>
>> cor.test(a,b,method="spearman", use="complete.obs")
>
> Spearman's rank correlation rho
>
> data: a and b
> S = 2.2222, p-value = 0.2222
> alternative hypothesis: true rho is not equal to 0
> sample estimates:
> rho
> 0.7777778
>
>
> So cor and cor.test do not agree, which seems very likely to be a bug.
> When calculating by hand, I also get 0.7777778. Additionally, when
> using an old version of R (2.5.0), both the complete.obs and
> pairwise.complete.obs versions give 0.7777778. Which strongly suggests
> either 2.5.0 or 2.11.1 has a bug in it. Is this a bug? If so, has it
> already been reported? (I found a related but confusing email thread
> from 2004 in the R archives, but I did not find the resolution to that
> bug report).
>
>
> Additional info:
> Platform = Windows XP
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.1
>> Sys.getlocale()
> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> Thanks,
>
> --Hugh
>
More information about the R-help
mailing list