[R] robust method to obtain a correlation coeff?

David Winsemius dwinsemius at comcast.net
Mon Aug 24 17:53:17 CEST 2009


On Aug 24, 2009, at 11:38 AM, David Winsemius wrote:

>
> On Aug 24, 2009, at 11:26 AM, (Ted Harding) wrote:
>
>> On 24-Aug-09 14:47:02, Christian Meesters wrote:
>>> Hi,
>>> Being a R-newbie I am wondering how to calculate a correlation
>>> coefficient (preferably with an associated p-value) for data like:
>>>
>>>> d[,1]
>>> [1] 25.5 25.3 25.1   NA 23.3 21.5 23.8 23.2 24.2 22.7 27.6 24.2 ...
>>>> d[,2]
>>> [1]  0.0 11.1  0.0   NA  0.0 10.1 10.6  9.5  0.0 57.9  0.0  0.0  ...
>>>
>>> Apparently corr(d) from the boot-library fails with NAs in the data,
>>
>> Yes, apparently corr() has no option for dealing with NAs.
>>
>>> also cor.test cannot cope with a different number of NAs.
>>
>> On the other hand, cor.test() does have an option "na.action"
>> which, by default, is the same as what is in getOption("na.action").
>>
>> In my R installation, this, by default, is "na.omit". This has the
>> effect that, for any pair in (x,y) where at least one of the pair
>> is NA, that pair will be omitted from the calculation. For example,
>> basing two vectors x,y on your data above, and a third z which is y
>> with an extra NA:
>>
>> x<-c(25.5,25.3,25.1,NA,23.3,21.5,23.8,23.2,24.2,22.7,27.6,24.2)
>> y<-c( 0.0,11.1, 0.0,NA, 0.0,10.1,10.6, 9.5, 0.0,57.9, 0.0, 0.0)
>> z<-y; z[8]<-NA
>>
>> I get
>> cor.test(x,y)
>> <snipped unneeded output>
>> # sample estimates:
>> #        cor
>> # -0.4298726
>>
>> So it has worked in both cases (see the difference in 'df'), despite
>> the different numbers of NAs in x and z.
>
> You may not need to go through the material that follows. There are  
> already a set of functions to handle such concerns:
>
> ?na.omit will bring a help page describing:
>
> na.fail(object, ...) na.omit(object, ...) na.exclude(object, ...)  
> na.pass(object, ...)
>

Apologies; this was a bit hastily constructed. What I was quoting in  
what follows was from the Options help page and "Options set in  
package stats" section of that help page.

> na.action: the name of a function for treating missing values (NA's)  
> for certain situations.
>
> ... but I do not know what those "certain situations" really are.
So there are some function that may be affected by settings of  
options("na.action") but I cannot tell you where to find a list of  
such functions.


>>
>> For functions such as corr() which do not have provision for omitting
>> NAs, you can fix it up for yourself before calling the function.
>> In the case of your two series d[,1], d[,2] you could use an index
>> variable to select cases:
>>
>> ix <- (!is.na(d[,1]))&(!is.na(d[,2]))
>> corr(d[ix,])
>>
>> With my variables x,y,z I get
>>
>> ix.1 <- (!is.na(x))&(!is.na(y))
>> ix.2 <- (!is.na(x))&(!is.na(z))
>> d.1  <-cbind(x,y)
>> corr(d.1[ix.1,])
>> # [1] -0.422542  ## (and -0.422542 from cor.test above as well)
>> d.2  <- cbind(x,z)
>> corr(d.2[ix.2,])
>> # [1] -0.4298726 ## (and -0.4298726 from cor.test above as well)
>>
>> Hoping this helps,
>> Ted.

>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list