[R-sig-Geo] Principal Component Analysis - Selectingcomponents? + right choice?

Corrado ct529 at york.ac.uk
Thu Dec 11 15:44:13 CET 2008


Hi all,

I used a rule of thumb as reported by the book quoted, but I am not completely 
happy with it, because it is not really a statistical justification.

I will try the broken stick approach, thanks!

Concerning the interpretation, luckily enough PC1 has a clear interpretation. 
PC2 a bit less so, though .... and the complexity of interpretation increases 
with explained variance decreasing.

I am using the approach suggested by Kamran: I have brewed down the original 
climatic variables to uncorrelated "environmental variables", and I have 
chosen the signification ones using a threshold. I do realise spatial 
auto-correlation is going to be important (even if my sites are fairly 
distant from one another, reducing the impact), but  do not know anything 
about spatial filtering. Whilst I have used distance matrices, I have never 
used them to remove spatial auto-correlation!

Could you please point me out to some resources please?

Best,




On Thursday 11 December 2008 14:29:27 Kamran Safi wrote:
> Hi all,
>
> I agree with Ashton. The issue is very complex and far from resolved.
> But sometimes we have to go down the PCA path. Among the many possible
> solutions is the broken stick approach, for which you find an R solution
> (bstick()) in the package vegan. Technically the broken stick randomly
> divides 100% variance into your N principal components and generates a
> null expectation for the distributions of randomly partioning the
> original variance. You then take all those PCAs that are above the
> broken stick distribution. This is by no means an agreed upon approach,
> but it is at least reproducible and has some theory behind it, but is
> and will remain a rule of thumb.
> In terms of spatial analysis you could derive the PCAs and then go into
> classic spatial analysis. Although the interpretation of PCA is
> sometimes complicated or even impossible, you can calculate the values
> for every grid cell and then go into multivariate analysis whereby you
> have to take spatial autocorrelation into account. At least your PCA
> components are orthogonal, which simplifies your analysis in contrast to
> using the original variables. It also allows you to produce predictive
> models.
> What you could think of doing could be using PCA to derive
> "environmental" variables which are uncorrelated and the use the
> distance matrix and spatial filtering to "remove" spatial
> autocorrelation.
>
> Hope this helps,
>
> Kami
>
>
>
> ------------------------
> Kamran Safi
>
> Postdoctoral Research Fellow
> Institute of Zoology
> Zoological Society of London
> Regent's Park
> London NW1 4RY
>
> http://www.zoo.cam.ac.uk/ioz/people/safi.htm
>
> http://spatialr.googlepages.com
> http://asapi.wetpaint.com
>
> -----Original Message-----
> From: r-sig-geo-bounces at stat.math.ethz.ch
> [mailto:r-sig-geo-bounces at stat.math.ethz.ch] On Behalf Of Ashton
> Shortridge
> Sent: 11 December 2008 14:11
> To: r-sig-geo at stat.math.ethz.ch
> Cc: Corrado
> Subject: Re: [R-sig-Geo] Principal Component Analysis -
> Selectingcomponents? + right choice?
>
> Hi Corrado,
>
> > I run the PCA using prcomp, quite successfully. Now I need to use a
> > criteria to select the right number of PC. (that is: is it 1,2,3,4?)
> >
> > What criteria would you suggest?
>
> that's an interesting and probably controversy-generating question. It's
>
> probably not an R-sig-geo question, either. I am not a PCA person, but
> the
> rule of thumb I am aware of is to plot the variability each
> component 'explains' and look for a clear breakpoint. I would think
> about any
> multivariate analysis text would have a better explanation than I can
> give,
> though.
>
> As for something more rigorous, I think a lot of people are reluctant to
> use
> PCA as a modeling approach not so much because it's hard to choose a
> threshold for selecting components, but because the interpretation of
> the
> meaning of each component is pretty subjective. If you want an
> explanatory
> model, be careful about using PCA. You would be better served by
> deciding,
> based perhaps on expert knowledge about the variables, which ones to use
> in
> the model and which ones not to.
>
> To try to make this a bit more spatial, and therefore more relevant to
> the
> list, I will also warn you that your various climate variables are
> almost
> certainly spatially autocorrelated - that is, neighboring and nearby
> observations in the grid are not independent. That has serious
> implications
> for standard multivariate analysis techniques and diagnostics.
>
> Yours,
>
> Ashton
>
> On Thursday 11 December 2008 06:46:37 am Corrado wrote:
> > Dear R gurus,
> >
> > I have some climatic data for a region of the world. They are monthly
> > averages 1950 -2000 of precipitation (12 months), minimum temperature
>
> (12
>
> > months), maximum temperature (12 months). I have scaled them to 2 km x
>
> 2km
>
> > cells, and I have around 75,000 cells.
> >
> > I need to feed them into a statistical model as co-variates, to use
>
> them to
>
> > predict a response variable.
> >
> > The climatic data are obviously correlated: precipitation for January
>
> is
>
> > correlated to precipitation for February and so on .... even
>
> precipitation
>
> > and temperature are heavily correlated. I did some correlation
>
> analysis and
>
> > they are all strongly correlated.
> >
> > I though of running PCA on them, in order to reduce the number of
> > co-variates I feed into the model.
> >
> > I run the PCA using prcomp, quite successfully. Now I need to use a
> > criteria to select the right number of PC. (that is: is it 1,2,3,4?)
> >
> > What criteria would you suggest?
> >
> > At the moment, I am using a criteria based on threshold, but that is
>
> highly
>
> > subjective, even if there are some rules of thumb (Jolliffe,Principal
> > Component Analysis, II Edition, Springer Verlag,2002).
> >
> > Could you suggest something more rigorous?
> >
> > By the way, do you think I would have been better off by using
>
> something
>
> > different from PCA?
> >
> > Best,



-- 
Corrado Topi

Global Climate Change & Biodiversity Indicators
Area 18,Department of Biology
University of York, York, YO10 5YW, UK
Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk




More information about the R-sig-Geo mailing list