[R] robust method to obtain a correlation coeff?

Mon Aug 24 17:26:56 CEST 2009

On 24-Aug-09 14:47:02, Christian Meesters wrote:
> Hi,
> Being a R-newbie I am wondering how to calculate a correlation
> coefficient (preferably with an associated p-value) for data like:
> 
>> d[,1]
>  [1] 25.5 25.3 25.1   NA 23.3 21.5 23.8 23.2 24.2 22.7 27.6 24.2 ...
>> d[,2]
> [1]  0.0 11.1  0.0   NA  0.0 10.1 10.6  9.5  0.0 57.9  0.0  0.0  ...
> 
> Apparently corr(d) from the boot-library fails with NAs in the data,

Yes, apparently corr() has no option for dealing with NAs.

> also cor.test cannot cope with a different number of NAs.

On the other hand, cor.test() does have an option "na.action"
which, by default, is the same as what is in getOption("na.action").

In my R installation, this, by default, is "na.omit". This has the
effect that, for any pair in (x,y) where at least one of the pair
is NA, that pair will be omitted from the calculation. For example,
basing two vectors x,y on your data above, and a third z which is y
with an extra NA:

  x<-c(25.5,25.3,25.1,NA,23.3,21.5,23.8,23.2,24.2,22.7,27.6,24.2)
  y<-c( 0.0,11.1, 0.0,NA, 0.0,10.1,10.6, 9.5, 0.0,57.9, 0.0, 0.0)
  z<-y; z[8]<-NA

I get
  cor.test(x,y)
  #         Pearson's product-moment correlation
  # data:  x and y 
  # t = -1.3986, df = 9, p-value = 0.1954
  # alternative hypothesis: true correlation is not equal to 0 
  # 95 percent confidence interval:
  #  -0.8156678  0.2375438 
  # sample estimates:
  #       cor 
  # -0.422542 
  # cor.test(x,z)
  #         Pearson's product-moment correlation
  # data:  x and z 
  # t = -1.3466, df = 8, p-value = 0.215
  # alternative hypothesis: true correlation is not equal to 0 
  # 95 percent confidence interval:
  #  -0.8338184  0.2738824 
  # sample estimates:
  #        cor 
  # -0.4298726 

So it has worked in both cases (see the difference in 'df'), despite
the different numbers of NAs in x and z.

For functions such as corr() which do not have provision for omitting
NAs, you can fix it up for yourself before calling the function.
In the case of your two series d[,1], d[,2] you could use an index
variable to select cases:

  ix <- (!is.na(d[,1]))&(!is.na(d[,2]))
  corr(d[ix,])

With my variables x,y,z I get

  ix.1 <- (!is.na(x))&(!is.na(y))
  ix.2 <- (!is.na(x))&(!is.na(z))
  d.1  <-cbind(x,y)
  corr(d.1[ix.1,])
  # [1] -0.422542  ## (and -0.422542 from cor.test above as well)
  d.2  <- cbind(x,z)
  corr(d.2[ix.2,])
  # [1] -0.4298726 ## (and -0.4298726 from cor.test above as well)

Hoping this helps,
Ted.

> Is there a
> solution to this problem (calculating a correlation coefficient and
> ignoring different number of NAs), e.g. Pearson's corr coeff?
> 
> If so, please point me to the relevant piece of documentation.
> 
> TIA
> Christian
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 24-Aug-09                                       Time: 16:26:53
------------------------------ XFMail ------------------------------