[R] cor and missing values. Bug?
Jane Fridlyand
janef at stat.Berkeley.EDU
Wed May 26 01:17:57 CEST 2004
There seems to be an issue in computing rank correlations with missing
values present. I think this comes from the way rank() function works but
I am not sure how to go about this. Rank function places missing values at
the end by default thus skewing the rank relationship between two vectors:
Example:
R : Copyright 2003, The R Foundation for Statistical Computing
Version 1.8.1 (2003-11-21), ISBN 3-900051-00-3
> vec1 <- 1:10
> vec2 <- 2*vec1
> vec1[c(1, 5)] <- NA
> cor(vec1, vec2, use="pair", method="pearson")
[1] 1
> cor(vec1[-c(1,5)], vec2[-c(1,5)], use="pair", method="pearson")
[1] 1
#pearson is OK
> cor(vec1, vec2, use="pair", method="spearman")
[1] 0.3212121
> cor(vec1[-c(1,5)], vec2[-c(1,5)], use="pair", method="spearman")
[1] 1
> cor(vec1, vec2, use="complete", method="spearman")
[1] 0.3212121
#BUG?
Interestingly, "complete" option which should exclude missing values
entirely does not fix an issue either. I think that rank function must be
applied before "use" is used (actually it is the case looking at the
actual code of cor).
I looked though the archives but have not seen this reported. Is it a bug
of rank-correlations or am I misinterpreting the intention?
Thank you
Jane
********************************************************************************
Jane Fridlyand, Assistant Professor
Department of Epidemiology and Biostatistics
Center for Bioinformatics and Molecular Biostatistics
UCSF Comprehensive Cancer Center,
Box 0128 San Francisco, CA 94143-0128
Office: Room N224 Tel: (415)476-0168 Fax: (415)502-3179
More information about the R-help
mailing list