[Rd] Possible bug in Spearman correlation with use="pairwise.complete.obs"

Simon Anders anders at embl.de
Fri Jan 21 19:13:45 CET 2011


Hi,

I have just encountered a strange behaviour from 'cor' with regards to 
the treatment of NAs when calculating Spearman correlations. I guess it 
is a subtle bug.

If I understand the help page correctly, the two modes 'complete.obs' 
and 'pairwise.complete.obs' specify how to deal with correlation 
coefficients when calculating a correlation _matrix_. When calculating a 
single (scalar) correlation coefficient for two data vectors x and y, 
both should give the same result.

For Pearson correlation, this is in fact the case:

> x <- runif( 10 )
> y <- runif( 10 )
> y[5] <- NA

> cor( x, y, use="complete.obs" )
[1] 0.407858
> cor( x, y, use="pairwise.complete.obs" )
[1] 0.407858

For Spearman correlation, we do NOT get the same results

> cor( x, y, method="spearman", use="complete.obs" )
[1] 0.3416009
> cor( x, y, method="spearman", use="pairwise.complete.obs" )
[1] 0.3333333

To see the likely reason for this possible bug, observe:

> goodobs <- !is.na(x) & !is.na(y)

> cor( rank(x)[goodobs], rank(y)[goodobs] )
[1] 0.3416009
> cor( rank(x[goodobs]), rank(y[goodobs]) )
[1] 0.3333333

I would claim that only the calculation resulting in 0.3333 is a proper 
Spearman correlation, while the line resulting in 0.3416 is not. After 
all, the following is not a complete set of ranks because there are 9 
observations, numbered from 1 to 10, skipping the 3:

> rank(x)[goodobs]
[1] 10  6  8  7  4  5  1  9  2

Would you hence agree that 'method="spearman"' with 
'use="pairwise.complete.obs"' is incorrect?

Cheers
   Simon


> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
  [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
  [7] LC_PAPER=en_US.utf8       LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pspearman_0.2-5 SuppDists_1.1-8

loaded via a namespace (and not attached):
[1] tools_2.12.0




+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de



More information about the R-devel mailing list