[R] cor, cov, method "pairwise.complete.obs"

Peter Dalgaard p.dalgaard at biostat.ku.dk
Fri Oct 22 13:52:07 CEST 2004


Eric Lecoutre <lecoutre at stat.ucl.ac.be> writes:

> Hi UseRs,
> 
> I don't want to die beeing idiot...

With age, most of us come to realise that that is the only possible
outcome...
 
> I dont understand the different results between:
> cor() and cov2cov(cov()).
> 
> See this little example:
> 
>  > x=matrix(c(0.5,0.2,0.3,0.1,0.4,NA,0.7,0.2,0.6,0.1,0.4,0.9),ncol=3)
>  > cov2cor(cov(x,use="pairwise.complete.obs"))
>             [,1]       [,2]       [,3]
> [1,]  1.0000000  0.4653400 -0.1159542
> [2,]  0.4653400  1.0000000 -0.7278728
> [3,] -0.1159542 -0.7278728  1.0000000
>  > cor(x,use="pairwise.complete.obs")
>             [,1]       [,2]       [,3]
> [1,]  1.0000000  0.3973597 -0.1159542
> [2,]  0.3973597  1.0000000 -0.9736842
> [3,] -0.1159542 -0.9736842  1.0000000
> 
> 
> My question arises in a context where cor(mydata,
> use="pairwise.complete.obs")  returns correlations on diagonal that
> are near 0.95 (where as my data do have 100 observations and only 12
> missing values...).

[Eh? "correlations on diagonal" are usually 1.00 for me!]
 
> Do cor() and cov() handle the same way the argument "pairwise.complete.obs"?

Obviously not.

It's not massively hard to experiment your way out of this. Consider

> cov(x,use="pairwise.complete.obs")
             [,1]        [,2]         [,3]
[1,]  0.029166667  0.02000000 -0.006666667
[2,]  0.020000000  0.06333333 -0.061666667
[3,] -0.006666667 -0.06166667  0.113333333

The diagonal elements of this is

> apply(x,2,var,na.rm=T)
[1] 0.02916667 0.06333333 0.11333333

Now, for the correlation calculation, you would arguably need

> var(x[!is.na(x[,2]),1])
[1] 0.04

and we have

> 0.02/sqrt(0.029166667*0.06333333)
[1] 0.4653401

> 0.02/sqrt(0.04*0.06333333)
[1] 0.3973597

which are the two versions of the correlation that you see. 

I.e., with cov(), you only require pairwise completeness for the
off-diagonal terms and remove individual NAs for the diagonal terms
(would be a bit difficult even to define them otherwise!). With cor()
the variances are computed separately for each (x,y) pair.

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907




More information about the R-help mailing list