[R-sig-Geo] Principal Component Analysis - Selecting components? + right choice?

Ashton Shortridge ashton at msu.edu
Thu Dec 11 15:10:38 CET 2008


Hi Corrado,

> I run the PCA using prcomp, quite successfully. Now I need to use a
> criteria to select the right number of PC. (that is: is it 1,2,3,4?)
>
> What criteria would you suggest?

that's an interesting and probably controversy-generating question. It's 
probably not an R-sig-geo question, either. I am not a PCA person, but the 
rule of thumb I am aware of is to plot the variability each 
component 'explains' and look for a clear breakpoint. I would think about any 
multivariate analysis text would have a better explanation than I can give, 
though.

As for something more rigorous, I think a lot of people are reluctant to use 
PCA as a modeling approach not so much because it's hard to choose a 
threshold for selecting components, but because the interpretation of the 
meaning of each component is pretty subjective. If you want an explanatory 
model, be careful about using PCA. You would be better served by deciding, 
based perhaps on expert knowledge about the variables, which ones to use in 
the model and which ones not to.

To try to make this a bit more spatial, and therefore more relevant to the 
list, I will also warn you that your various climate variables are almost 
certainly spatially autocorrelated - that is, neighboring and nearby 
observations in the grid are not independent. That has serious implications 
for standard multivariate analysis techniques and diagnostics.

Yours,

Ashton

On Thursday 11 December 2008 06:46:37 am Corrado wrote:
> Dear R gurus,
>
> I have some climatic data for a region of the world. They are monthly
> averages 1950 -2000 of precipitation (12 months), minimum temperature (12
> months), maximum temperature (12 months). I have scaled them to 2 km x 2km
> cells, and I have around 75,000 cells.
>
> I need to feed them into a statistical model as co-variates, to use them to
> predict a response variable.
>
> The climatic data are obviously correlated: precipitation for January is
> correlated to precipitation for February and so on .... even precipitation
> and temperature are heavily correlated. I did some correlation analysis and
> they are all strongly correlated.
>
> I though of running PCA on them, in order to reduce the number of
> co-variates I feed into the model.
>
> I run the PCA using prcomp, quite successfully. Now I need to use a
> criteria to select the right number of PC. (that is: is it 1,2,3,4?)
>
> What criteria would you suggest?
>
> At the moment, I am using a criteria based on threshold, but that is highly
> subjective, even if there are some rules of thumb (Jolliffe,Principal
> Component Analysis, II Edition, Springer Verlag,2002).
>
> Could you suggest something more rigorous?
>
> By the way, do you think I would have been better off by using something
> different from PCA?
>
> Best,



-- 
Ashton Shortridge
Associate Professor			ashton at msu.edu
Dept of Geography			http://www.msu.edu/~ashton
235 Geography Building		ph (517) 432-3561
Michigan State University		fx (517) 432-1671




More information about the R-sig-Geo mailing list