[R-sig-eco] Should one remove highly correlated variables before doing PCA??

Baldwin, Jim -FS jbaldwin at fs.fed.us
Wed Mar 6 15:10:54 CET 2013


Two additional issues might be considered:

1.  Correlated variables are still correlated after PCA or after tossing one of the variables so teasing apart separate effects of the two variables is not resolved (nor can it necessarily be resolved with the particular dataset at hand).

2.  The purpose for using PCA should be clear and determined to meet your objectives.  Just because you can do a PCA doesn't mean you should.  For example, if PCA is performed to obtain "uncorrelated" variables for a regression, then consider that the component explaining the most variation will not necessarily be a wonderful predictor.  The component explaining the least amount of variation might be the best predictor.  Performing PCA for a regression has always puzzled me because why would one think that doing something in complete isolation of the dependent variable would make for better predictors.  (Orthogonal and more numerically stable estimators of the coefficients, yes, but not necessarily coefficients of interest.)

Jim

-----Original Message-----
From: r-sig-ecology-bounces at r-project.org [mailto:r-sig-ecology-bounces at r-project.org] On Behalf Of Chris Howden
Sent: Tuesday, March 05, 2013 10:45 PM
To: 张勇; r-sig-ecology at r-project.org
Subject: Re: [R-sig-eco] Should one remove highly correlated variables before doing PCA??

Hi Yong,

PCA is a way to deal with highly correlated variables, so there is no need to remove them.

If N variables are highly correlated than they will all load out on the SAME Principal Component (Eigenvector), not different ones. This is how you identify them as being highly correlated. If you were to do further analysis U can then either:

1) Use the PCA, and interpret it according to what variables load out on it
2) Choose one of the highly correlated variables as identified as those that all load onto the same variable and analyse only it.

Most people if using PCA would use option 1)

A bit more detail.

Many methods have a hard time dealing with multicollinearity, which is when there are a number of variables that are highly correlated (I suggest U Google it). Before analysis this is usually dealt with in one of 2 ways:
1) Use PCA to get a set of orthogonal i.e. not correlated, variables and analyse them
2) Use correlation co-efficients to determine which variables are highly correlated and use only 1 in the analysis. A cut off for highly correlated is often 0.8.

Variance Inflation Factors are also used. Personally I don't like them since they don't tell me what variables are correlated with. They are also clumsy to use. U can't simply remove all variables with high VIF or you will likely remove some useful variables e.g. if 4 variables all have a high VIF U don't know if it's because all 4 are correlated or if there are 2 sets of highly correlated variables. So which do U remove???  If U must use them it's IMPERATIVE that U only remove 1 at a time and then rerun to get new VIF's, remove 1, get new VIF's, remove 1, etc.... this prevents U removing too many variables.


Chris Howden B.Sc. (Hons) GStat.
Founding Partner
Evidence Based Strategic Development, IP Commercialisation and Innovation, Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax) +612 4782 9023
chris at trickysolutions.com.au




Disclaimer: The information in this email and any attachments to it are confidential and may contain legally privileged information. If you are not the named or intended recipient, please delete this communication and contact us immediately. Please note you are not authorised to copy, use or disclose this communication or any attachments without our consent. Although this email has been checked by anti-virus software, there is a risk that email messages may be corrupted or infected by viruses or other interferences. No responsibility is accepted for such interference. Unless expressly stated, the views of the writer are not those of the company.
Tricky Solutions always does our best to provide accurate forecasts and analyses based on the data supplied, however it is possible that some important predictors were not included in the data sent to us. Information provided by us should not be solely relied upon when making decisions and clients should use their own judgement.

-----Original Message-----
From: r-sig-ecology-bounces at r-project.org
[mailto:r-sig-ecology-bounces at r-project.org] On Behalf Of ??
Sent: Wednesday, 6 March 2013 4:33 PM
To: r-sig-ecology at r-project.org
Subject: [R-sig-eco] Should one remove highly correlated variables before doing PCA??

Hi list,

Maybe this is not a "R" question, however, it has bothered me for a long time.

Some people think if a set of correlated variables might "load" onto several principal components (eigenvectors),so including many variables from such a set will differentially weight several eigenvectors--and thereby change the directions of all eigenvectors, too.  So, according to these considerations, we should discard some highly correlated variables before doing PCA.

On the other hand, some people think that correlated variables are ok, because PCA outputs vectors that are orthogonal.  So we do not need to remove highly correlated variables before doing PCA.

However, for myself, I choose the first method (removing highly correlated variables). But, based on the practical ecology knowledge, I will retain most of the ecological meaningful variables as possible as I can.

What's your suggestion for this issue? Any hint will be greatly appreciated!
Thanks a lot in advance.

Best regards,

Yong

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology





This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.


More information about the R-sig-ecology mailing list