[R] (OT) Does pearson correlation assume bivariate normality of the data?

Tue May 26 23:01:40 CEST 2009

This is the sort of problem (another related one is the assumptions of the 
t-test) that attracts a lot of relatively inefficient argument.

Some basic points

1. If random variables X and Y are uncorrelated (and have finite moments, 
but that's a purely technical issue), the distribution of the Pearson 
correlation coefficient in samples from X and Y will be Normal with mean 
zero in large samples. No further assumption about distribution is needed. 
So, the test is valid in sufficiently large samples.

2.  Similarly, the sample correlation coefficient between two random 
variables X and Y is a consistent estimator of the correlation between X 
and Y.  Here the distribution [needed for confidence intervals] does 
depend on the distributions of X and Y, but by less than you might expect. 
For example, I found that Fisher's z-transformation and a t-distribution 
with n-3 df is a pretty good approximation to the distribution of 
correlation between lognormal random variables (a model for air pollution 
data)  with a sample size of 10.

3. If X and Y are bivariate Normal and uncorrelated, they must be 
independent, so the null hypothesis of zero correlation is especially 
interesting for Normal data.

4. Zero correlation may still be an interesting null hypothesis without 
bivariate Normality -- if you don't know much about X and Y it may be an 
advance to be able to establish that Y tends to be higher when X is 
higher.

5. The correlation coefficient is sensitive to outlying observations. This 
is not necessarily a bad thing, but it means that if X and Y both have 
long-tailed distributions the test for zero correlation will be sensitive 
primarily to the tails.

6. If the tails of the distribution are mostly gross-error contamination, 
the sensitivity to the tails is bad.

7. The various robust or rank-based correlations don't estimate the same 
thing, any more than the mean and median estimate the same thing. They 
don't necessarily even have to have the same sign.  Some of them are 
intended for bivariate Normal data with gross-error contamination, which 
is fine if that is what you have.  Kendall's tau at least has a sensible 
interpretation that doesn't depend on distributions, whereas it's not 
clear to me why the hypothesis of zero Spearman correlation would be 
interesting without distributional assumptions.

8. Permutation tests will give you an exact small-sample test of 
*independence*, not of zero correlation. The test is not exact (it may be 
conservative or anticonservative) if X and Y are dependent but 
uncorrelated. The test has power only against alternatives where the 
correlation is non-zero.

Some of the issues behind the confusion are the same as for the t-test:
  - a confusion of necessary vs sufficient assumptions
  - a confusion of long-tailed distributions and gross error contamination
  - worrying about the meaning of the null hypothesis only for 'parametric'
    tests and not for 'non-parametric tests'
  - not understanding that permutation tests have assumptions.

There is also some genuine and informed disagreement about the relative 
importance of potential problems. Some of this disagreement is about 
philosophical issues, and some is about the likely pratical impact, which 
depends a lot on the setting.

 	-thomas

On Tue, 26 May 2009, Liviu Andronic wrote:

> Dear all,
> The other day I was reading this post [1] that slightly surprised me:
> "To reject the null of no correlation, an hypothsis test based on the
> normal distribution. If normality is not the base assumption your
> working from then p-values, significance tests and conf. intervals
> dont mean much (the value of the coefficient is not reliable) " (BOB
> SAMOHYL).
>
> To me this implied that in practice Pearson's product-moment
> correlation (and associated significance) is often used incorrectly .
> Then I went wrestling with the literature, and with my friends on what
> does the Pearson correlation actually impose, and after about a week
> I'm still head-banging against divergent opinions. From what I
> understand there are two aspects to this classical parametric
> procedure:
> 1. Estimating the magnitude of the correlation:
> - the sample data should come from a bivariate normal distribution
> (?cor, ?cor.test, Dalgaard  2003, somewhat implied in many examples
> such as ?rrcov::maryo or Wilcox 2005)
> - the sample data should be (I presume univariate) normal (Crawley
> 2007)
> - the sample data can be of any distribution (if I understand
> correctly the `distribution-free' definition of correlation in Huber
> 1981, 2004)
> - the sample data could come from just about any bivariate
> distribution (Wikipedia [2][3] and associated reference)
> - the coefficient is (very) not robust to univariate outliers (e.g.,
> Huber 1981), and to multivariate outliers (?rrcov::maryo with data
> from Marona and Yohai 1998)
>
> 2. Assessing whether the correlation is significantly different from
> zero (using a statistic following the t distribution):
> - the data should come from independent normal distributions (?cor.test)
> - at least one of the marginal distributions is normal (Wilcox 2005)
>
> Surprisingly (to me) many sources seem quite evasive on clearly
> defining the pearson correlation. Reading the literature I was pretty
> much convinced that the correlation coefficient is not robust to
> outliers. The literature is also convincing on the impact of
> contaminated normal, heavy-tailed distributions on parametric tests
> (invalidating their results). However, I'm not clear on the
> distributional assumptions on the data:
> - does the data have to be bivariate normal in order to correctly
> estimate the linear correlation?
> - does the data have to be univariate normal in order to correctly
> estimate the significance of the correlation?
>
> If the above is true, what are the preferable alternatives for
> non-gaussian data (including heavy-tailed normal)? non-parametric
> tests (spearman, kendall)? the robust MASS::cov.mcd, rrcov::CovOgk,
> robust::covRob()? hypothesis testing via Permutation Tests [4]? is
> there a robust cor.test? other robust tests of independence?
>
> Thank you,
> Liviu
>
> [1] http://www.nabble.com/Correlation-on-Tick-Data-tp18589474p18595197.html
> [2] http://en.wikipedia.org/wiki/Correlation#Sensitivity_to_the_data_distribution
> [3] http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Sensitivity_to_the_data_distribution
> [4] http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html#permtest
>
>
>
> -- 
> Do you know how to read?
> http://www.alienetworks.com/srtest.cfm
> Do you know how to write?
> http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle