[R] cor, cov, method "pairwise.complete.obs"
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Fri Oct 22 13:52:07 CEST 2004
Eric Lecoutre <lecoutre at stat.ucl.ac.be> writes:
> Hi UseRs,
>
> I don't want to die beeing idiot...
With age, most of us come to realise that that is the only possible
outcome...
> I dont understand the different results between:
> cor() and cov2cov(cov()).
>
> See this little example:
>
> > x=matrix(c(0.5,0.2,0.3,0.1,0.4,NA,0.7,0.2,0.6,0.1,0.4,0.9),ncol=3)
> > cov2cor(cov(x,use="pairwise.complete.obs"))
> [,1] [,2] [,3]
> [1,] 1.0000000 0.4653400 -0.1159542
> [2,] 0.4653400 1.0000000 -0.7278728
> [3,] -0.1159542 -0.7278728 1.0000000
> > cor(x,use="pairwise.complete.obs")
> [,1] [,2] [,3]
> [1,] 1.0000000 0.3973597 -0.1159542
> [2,] 0.3973597 1.0000000 -0.9736842
> [3,] -0.1159542 -0.9736842 1.0000000
>
>
> My question arises in a context where cor(mydata,
> use="pairwise.complete.obs") returns correlations on diagonal that
> are near 0.95 (where as my data do have 100 observations and only 12
> missing values...).
[Eh? "correlations on diagonal" are usually 1.00 for me!]
> Do cor() and cov() handle the same way the argument "pairwise.complete.obs"?
Obviously not.
It's not massively hard to experiment your way out of this. Consider
> cov(x,use="pairwise.complete.obs")
[,1] [,2] [,3]
[1,] 0.029166667 0.02000000 -0.006666667
[2,] 0.020000000 0.06333333 -0.061666667
[3,] -0.006666667 -0.06166667 0.113333333
The diagonal elements of this is
> apply(x,2,var,na.rm=T)
[1] 0.02916667 0.06333333 0.11333333
Now, for the correlation calculation, you would arguably need
> var(x[!is.na(x[,2]),1])
[1] 0.04
and we have
> 0.02/sqrt(0.029166667*0.06333333)
[1] 0.4653401
> 0.02/sqrt(0.04*0.06333333)
[1] 0.3973597
which are the two versions of the correlation that you see.
I.e., with cov(), you only require pairwise completeness for the
off-diagonal terms and remove individual NAs for the diagonal terms
(would be a bit difficult even to define them otherwise!). With cor()
the variances are computed separately for each (x,y) pair.
--
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list