[R] robust method to obtain a correlation coeff?
David Winsemius
dwinsemius at comcast.net
Mon Aug 24 17:38:21 CEST 2009
On Aug 24, 2009, at 11:26 AM, (Ted Harding) wrote:
> On 24-Aug-09 14:47:02, Christian Meesters wrote:
>> Hi,
>> Being a R-newbie I am wondering how to calculate a correlation
>> coefficient (preferably with an associated p-value) for data like:
>>
>>> d[,1]
>> [1] 25.5 25.3 25.1 NA 23.3 21.5 23.8 23.2 24.2 22.7 27.6 24.2 ...
>>> d[,2]
>> [1] 0.0 11.1 0.0 NA 0.0 10.1 10.6 9.5 0.0 57.9 0.0 0.0 ...
>>
>> Apparently corr(d) from the boot-library fails with NAs in the data,
>
> Yes, apparently corr() has no option for dealing with NAs.
>
>> also cor.test cannot cope with a different number of NAs.
>
> On the other hand, cor.test() does have an option "na.action"
> which, by default, is the same as what is in getOption("na.action").
>
> In my R installation, this, by default, is "na.omit". This has the
> effect that, for any pair in (x,y) where at least one of the pair
> is NA, that pair will be omitted from the calculation. For example,
> basing two vectors x,y on your data above, and a third z which is y
> with an extra NA:
>
> x<-c(25.5,25.3,25.1,NA,23.3,21.5,23.8,23.2,24.2,22.7,27.6,24.2)
> y<-c( 0.0,11.1, 0.0,NA, 0.0,10.1,10.6, 9.5, 0.0,57.9, 0.0, 0.0)
> z<-y; z[8]<-NA
>
> I get
> cor.test(x,y)
> <snipped unneeded output>
> # sample estimates:
> # cor
> # -0.4298726
>
> So it has worked in both cases (see the difference in 'df'), despite
> the different numbers of NAs in x and z.
You may not need to go through the material that follows. There are
already a set of functions to handle such concerns:
?na.omit will bring a help page describing:
na.fail(object, ...) na.omit(object, ...) na.exclude(object, ...)
na.pass(object, ...)
It reminded me that:
na.action: the name of a function for treating missing values (NA's)
for certain situations.
... but I do not know what those "certain situations" really are.
>
> For functions such as corr() which do not have provision for omitting
> NAs, you can fix it up for yourself before calling the function.
> In the case of your two series d[,1], d[,2] you could use an index
> variable to select cases:
>
> ix <- (!is.na(d[,1]))&(!is.na(d[,2]))
> corr(d[ix,])
>
> With my variables x,y,z I get
>
> ix.1 <- (!is.na(x))&(!is.na(y))
> ix.2 <- (!is.na(x))&(!is.na(z))
> d.1 <-cbind(x,y)
> corr(d.1[ix.1,])
> # [1] -0.422542 ## (and -0.422542 from cor.test above as well)
> d.2 <- cbind(x,z)
> corr(d.2[ix.2,])
> # [1] -0.4298726 ## (and -0.4298726 from cor.test above as well)
>
> Hoping this helps,
> Ted.
>
>> Is there a
>> solution to this problem (calculating a correlation coefficient and
>> ignoring different number of NAs), e.g. Pearson's corr coeff?
>>
>> If so, please point me to the relevant piece of documentation.
>>
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list