[Rd] Possible bug in Spearman correlation with use="pairwise.complete.obs"
Simon Anders
anders at embl.de
Fri Jan 21 19:13:45 CET 2011
Hi,
I have just encountered a strange behaviour from 'cor' with regards to
the treatment of NAs when calculating Spearman correlations. I guess it
is a subtle bug.
If I understand the help page correctly, the two modes 'complete.obs'
and 'pairwise.complete.obs' specify how to deal with correlation
coefficients when calculating a correlation _matrix_. When calculating a
single (scalar) correlation coefficient for two data vectors x and y,
both should give the same result.
For Pearson correlation, this is in fact the case:
> x <- runif( 10 )
> y <- runif( 10 )
> y[5] <- NA
> cor( x, y, use="complete.obs" )
[1] 0.407858
> cor( x, y, use="pairwise.complete.obs" )
[1] 0.407858
For Spearman correlation, we do NOT get the same results
> cor( x, y, method="spearman", use="complete.obs" )
[1] 0.3416009
> cor( x, y, method="spearman", use="pairwise.complete.obs" )
[1] 0.3333333
To see the likely reason for this possible bug, observe:
> goodobs <- !is.na(x) & !is.na(y)
> cor( rank(x)[goodobs], rank(y)[goodobs] )
[1] 0.3416009
> cor( rank(x[goodobs]), rank(y[goodobs]) )
[1] 0.3333333
I would claim that only the calculation resulting in 0.3333 is a proper
Spearman correlation, while the line resulting in 0.3416 is not. After
all, the following is not a complete set of ranks because there are 9
observations, numbered from 1 to 10, skipping the 3:
> rank(x)[goodobs]
[1] 10 6 8 7 4 5 1 9 2
Would you hence agree that 'method="spearman"' with
'use="pairwise.complete.obs"' is incorrect?
Cheers
Simon
> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pspearman_0.2-5 SuppDists_1.1-8
loaded via a namespace (and not attached):
[1] tools_2.12.0
+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de
More information about the R-devel
mailing list