[R-sig-eco] Should one remove highly correlated variables before doing PCA??
Chris Howden
chris at trickysolutions.com.au
Wed Mar 6 07:45:23 CET 2013
Hi Yong,
PCA is a way to deal with highly correlated variables, so there is no need
to remove them.
If N variables are highly correlated than they will all load out on the SAME
Principal Component (Eigenvector), not different ones. This is how you
identify them as being highly correlated. If you were to do further analysis
U can then either:
1) Use the PCA, and interpret it according to what variables load out on it
2) Choose one of the highly correlated variables as identified as those that
all load onto the same variable and analyse only it.
Most people if using PCA would use option 1)
A bit more detail.
Many methods have a hard time dealing with multicollinearity, which is when
there are a number of variables that are highly correlated (I suggest U
Google it). Before analysis this is usually dealt with in one of 2 ways:
1) Use PCA to get a set of orthogonal i.e. not correlated, variables and
analyse them
2) Use correlation co-efficients to determine which variables are highly
correlated and use only 1 in the analysis. A cut off for highly correlated
is often 0.8.
Variance Inflation Factors are also used. Personally I don't like them since
they don't tell me what variables are correlated with. They are also clumsy
to use. U can't simply remove all variables with high VIF or you will likely
remove some useful variables e.g. if 4 variables all have a high VIF U don't
know if it's because all 4 are correlated or if there are 2 sets of highly
correlated variables. So which do U remove??? If U must use them it's
IMPERATIVE that U only remove 1 at a time and then rerun to get new VIF's,
remove 1, get new VIF's, remove 1, etc.... this prevents U removing too many
variables.
Chris Howden B.Sc. (Hons) GStat.
Founding Partner
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax) +612 4782 9023
chris at trickysolutions.com.au
Disclaimer: The information in this email and any attachments to it are
confidential and may contain legally privileged information. If you are not
the named or intended recipient, please delete this communication and
contact us immediately. Please note you are not authorised to copy, use or
disclose this communication or any attachments without our consent. Although
this email has been checked by anti-virus software, there is a risk that
email messages may be corrupted or infected by viruses or other
interferences. No responsibility is accepted for such interference. Unless
expressly stated, the views of the writer are not those of the company.
Tricky Solutions always does our best to provide accurate forecasts and
analyses based on the data supplied, however it is possible that some
important predictors were not included in the data sent to us. Information
provided by us should not be solely relied upon when making decisions and
clients should use their own judgement.
-----Original Message-----
From: r-sig-ecology-bounces at r-project.org
[mailto:r-sig-ecology-bounces at r-project.org] On Behalf Of ??
Sent: Wednesday, 6 March 2013 4:33 PM
To: r-sig-ecology at r-project.org
Subject: [R-sig-eco] Should one remove highly correlated variables before
doing PCA??
Hi list,
Maybe this is not a "R" question, however, it has bothered me for a long
time.
Some people think if a set of correlated variables might "load" onto several
principal components (eigenvectors),so including many variables from such a
set will differentially weight several eigenvectors--and thereby change the
directions of all eigenvectors, too. So, according to these considerations,
we should discard some highly correlated variables before doing PCA.
On the other hand, some people think that correlated variables are ok,
because PCA outputs vectors that are orthogonal. So we do not need to
remove highly correlated variables before doing PCA.
However, for myself, I choose the first method (removing highly correlated
variables). But, based on the practical ecology knowledge, I will retain
most of the ecological meaningful variables as possible as I can.
What's your suggestion for this issue? Any hint will be greatly appreciated!
Thanks a lot in advance.
Best regards,
Yong
_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
More information about the R-sig-ecology
mailing list