[R] robust method to obtain a correlation coeff?

Mon Aug 24 17:38:21 CEST 2009

On Aug 24, 2009, at 11:26 AM, (Ted Harding) wrote:

> On 24-Aug-09 14:47:02, Christian Meesters wrote:
>> Hi,
>> Being a R-newbie I am wondering how to calculate a correlation
>> coefficient (preferably with an associated p-value) for data like:
>>
>>> d[,1]
>> [1] 25.5 25.3 25.1   NA 23.3 21.5 23.8 23.2 24.2 22.7 27.6 24.2 ...
>>> d[,2]
>> [1]  0.0 11.1  0.0   NA  0.0 10.1 10.6  9.5  0.0 57.9  0.0  0.0  ...
>>
>> Apparently corr(d) from the boot-library fails with NAs in the data,
>
> Yes, apparently corr() has no option for dealing with NAs.
>
>> also cor.test cannot cope with a different number of NAs.
>
> On the other hand, cor.test() does have an option "na.action"
> which, by default, is the same as what is in getOption("na.action").
>
> In my R installation, this, by default, is "na.omit". This has the
> effect that, for any pair in (x,y) where at least one of the pair
> is NA, that pair will be omitted from the calculation. For example,
> basing two vectors x,y on your data above, and a third z which is y
> with an extra NA:
>
>  x<-c(25.5,25.3,25.1,NA,23.3,21.5,23.8,23.2,24.2,22.7,27.6,24.2)
>  y<-c( 0.0,11.1, 0.0,NA, 0.0,10.1,10.6, 9.5, 0.0,57.9, 0.0, 0.0)
>  z<-y; z[8]<-NA
>
> I get
>  cor.test(x,y)
> <snipped unneeded output>
>  # sample estimates:
>  #        cor
>  # -0.4298726
>
> So it has worked in both cases (see the difference in 'df'), despite
> the different numbers of NAs in x and z.

You may not need to go through the material that follows. There are  
already a set of functions to handle such concerns:

?na.omit will bring a help page describing:

na.fail(object, ...) na.omit(object, ...) na.exclude(object, ...)  
na.pass(object, ...)

It reminded me that:

na.action: the name of a function for treating missing values (NA's)  
for certain situations.

... but I do not know what those "certain situations" really are.
>
> For functions such as corr() which do not have provision for omitting
> NAs, you can fix it up for yourself before calling the function.
> In the case of your two series d[,1], d[,2] you could use an index
> variable to select cases:
>
>  ix <- (!is.na(d[,1]))&(!is.na(d[,2]))
>  corr(d[ix,])
>
> With my variables x,y,z I get
>
>  ix.1 <- (!is.na(x))&(!is.na(y))
>  ix.2 <- (!is.na(x))&(!is.na(z))
>  d.1  <-cbind(x,y)
>  corr(d.1[ix.1,])
>  # [1] -0.422542  ## (and -0.422542 from cor.test above as well)
>  d.2  <- cbind(x,z)
>  corr(d.2[ix.2,])
>  # [1] -0.4298726 ## (and -0.4298726 from cor.test above as well)
>
> Hoping this helps,
> Ted.
>
>> Is there a
>> solution to this problem (calculating a correlation coefficient and
>> ignoring different number of NAs), e.g. Pearson's corr coeff?
>>
>> If so, please point me to the relevant piece of documentation.
>>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT