[R-sig-Geo] Principal Component Analysis - Selectingcomponents? + right choice?

Kamran Safi Kamran.Safi at ioz.ac.uk
Thu Dec 11 17:09:37 CET 2008


Hi Corrado,

A useful reference is: Diniz-Filho J.A.F. & Bini L.M. (2005). Modelling
geographical patterns in species richness using eigenvector-based
spatial filters. Global Ecology and Biogeography, 14, 177-185.

So what they do basically is using multidimensional scaling to derive
from the pairwise distance matrix a set of principal coordinates (not be
confused with principal components). MDS is a method (old!) that
converts a pairwise distance matrix into n-1 dimensional vertical
matrix. It is implemented in R base. Have a look at cmdscale() and
isoMDS(). There I would use all those PCo which significantly predict
your dependent variable(s) (But see below).

Another very useful and classic resource is this paper:
http://www.ufz.de/data/Dormann-et-al_Methods-Autocorrelation6834.pdf
Specifically its appendix. Spatial filtering is described there too.

Hope this helps. Good luck.

Kamran




------------------------
Kamran Safi

Postdoctoral Research Fellow
Institute of Zoology
Zoological Society of London
Regent's Park
London NW1 4RY

http://www.zoo.cam.ac.uk/ioz/people/safi.htm

http://spatialr.googlepages.com
http://asapi.wetpaint.com


-----Original Message-----
From: Corrado [mailto:ct529 at york.ac.uk] 
Sent: 11 December 2008 14:44
To: Kamran Safi
Cc: Ashton Shortridge; r-sig-geo at stat.math.ethz.ch
Subject: Re: [R-sig-Geo] Principal Component Analysis -
Selectingcomponents? + right choice?

Hi all,

I used a rule of thumb as reported by the book quoted, but I am not
completely 
happy with it, because it is not really a statistical justification.

I will try the broken stick approach, thanks!

Concerning the interpretation, luckily enough PC1 has a clear
interpretation. 
PC2 a bit less so, though .... and the complexity of interpretation
increases 
with explained variance decreasing.

I am using the approach suggested by Kamran: I have brewed down the
original 
climatic variables to uncorrelated "environmental variables", and I have

chosen the signification ones using a threshold. I do realise spatial 
auto-correlation is going to be important (even if my sites are fairly 
distant from one another, reducing the impact), but  do not know
anything 
about spatial filtering. Whilst I have used distance matrices, I have
never 
used them to remove spatial auto-correlation!

Could you please point me out to some resources please?

Best,




On Thursday 11 December 2008 14:29:27 Kamran Safi wrote:
> Hi all,
>
> I agree with Ashton. The issue is very complex and far from resolved.
> But sometimes we have to go down the PCA path. Among the many possible
> solutions is the broken stick approach, for which you find an R
solution
> (bstick()) in the package vegan. Technically the broken stick randomly
> divides 100% variance into your N principal components and generates a
> null expectation for the distributions of randomly partioning the
> original variance. You then take all those PCAs that are above the
> broken stick distribution. This is by no means an agreed upon
approach,
> but it is at least reproducible and has some theory behind it, but is
> and will remain a rule of thumb.
> In terms of spatial analysis you could derive the PCAs and then go
into
> classic spatial analysis. Although the interpretation of PCA is
> sometimes complicated or even impossible, you can calculate the values
> for every grid cell and then go into multivariate analysis whereby you
> have to take spatial autocorrelation into account. At least your PCA
> components are orthogonal, which simplifies your analysis in contrast
to
> using the original variables. It also allows you to produce predictive
> models.
> What you could think of doing could be using PCA to derive
> "environmental" variables which are uncorrelated and the use the
> distance matrix and spatial filtering to "remove" spatial
> autocorrelation.
>
> Hope this helps,
>
> Kami
>
>
>
> ------------------------
> Kamran Safi
>
> Postdoctoral Research Fellow
> Institute of Zoology
> Zoological Society of London
> Regent's Park
> London NW1 4RY
>
> http://www.zoo.cam.ac.uk/ioz/people/safi.htm
>
> http://spatialr.googlepages.com
> http://asapi.wetpaint.com
>
> -----Original Message-----
> From: r-sig-geo-bounces at stat.math.ethz.ch
> [mailto:r-sig-geo-bounces at stat.math.ethz.ch] On Behalf Of Ashton
> Shortridge
> Sent: 11 December 2008 14:11
> To: r-sig-geo at stat.math.ethz.ch
> Cc: Corrado
> Subject: Re: [R-sig-Geo] Principal Component Analysis -
> Selectingcomponents? + right choice?
>
> Hi Corrado,
>
> > I run the PCA using prcomp, quite successfully. Now I need to use a
> > criteria to select the right number of PC. (that is: is it 1,2,3,4?)
> >
> > What criteria would you suggest?
>
> that's an interesting and probably controversy-generating question.
It's
>
> probably not an R-sig-geo question, either. I am not a PCA person, but
> the
> rule of thumb I am aware of is to plot the variability each
> component 'explains' and look for a clear breakpoint. I would think
> about any
> multivariate analysis text would have a better explanation than I can
> give,
> though.
>
> As for something more rigorous, I think a lot of people are reluctant
to
> use
> PCA as a modeling approach not so much because it's hard to choose a
> threshold for selecting components, but because the interpretation of
> the
> meaning of each component is pretty subjective. If you want an
> explanatory
> model, be careful about using PCA. You would be better served by
> deciding,
> based perhaps on expert knowledge about the variables, which ones to
use
> in
> the model and which ones not to.
>
> To try to make this a bit more spatial, and therefore more relevant to
> the
> list, I will also warn you that your various climate variables are
> almost
> certainly spatially autocorrelated - that is, neighboring and nearby
> observations in the grid are not independent. That has serious
> implications
> for standard multivariate analysis techniques and diagnostics.
>
> Yours,
>
> Ashton
>
> On Thursday 11 December 2008 06:46:37 am Corrado wrote:
> > Dear R gurus,
> >
> > I have some climatic data for a region of the world. They are
monthly
> > averages 1950 -2000 of precipitation (12 months), minimum
temperature
>
> (12
>
> > months), maximum temperature (12 months). I have scaled them to 2 km
x
>
> 2km
>
> > cells, and I have around 75,000 cells.
> >
> > I need to feed them into a statistical model as co-variates, to use
>
> them to
>
> > predict a response variable.
> >
> > The climatic data are obviously correlated: precipitation for
January
>
> is
>
> > correlated to precipitation for February and so on .... even
>
> precipitation
>
> > and temperature are heavily correlated. I did some correlation
>
> analysis and
>
> > they are all strongly correlated.
> >
> > I though of running PCA on them, in order to reduce the number of
> > co-variates I feed into the model.
> >
> > I run the PCA using prcomp, quite successfully. Now I need to use a
> > criteria to select the right number of PC. (that is: is it 1,2,3,4?)
> >
> > What criteria would you suggest?
> >
> > At the moment, I am using a criteria based on threshold, but that is
>
> highly
>
> > subjective, even if there are some rules of thumb
(Jolliffe,Principal
> > Component Analysis, II Edition, Springer Verlag,2002).
> >
> > Could you suggest something more rigorous?
> >
> > By the way, do you think I would have been better off by using
>
> something
>
> > different from PCA?
> >
> > Best,



-- 
Corrado Topi

Global Climate Change & Biodiversity Indicators
Area 18,Department of Biology
University of York, York, YO10 5YW, UK
Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk


This message has been scanned for viruses by MailControl -
www.mailcontrol.com

Click
https://www.mailcontrol.com/sr/wQw0zmjPoHdJTZGyOCrrhg==
vm6C4VtfQHmif9hp5cCl61pOCDRy5!TUZWDEGNPdlyNkQ==  to report this email as
spam.


The Zoological Society of London is incorporated by Royal Charter
Principal Office England. Company Number RC000749
Registered address: 
Regent's Park, London, England NW1 4RY
Registered Charity in England and Wales no. 208728 

_________________________________________________________________________
This e-mail has been sent in confidence to the named addressee(s).
If you are not the intended recipient, you must not disclose or distribute
it in any form, and you are asked to contact the sender immediately.
Views or opinions expressed in this communication may not be those
of The Zoological Society of London and, therefore, The Zoological
Society of London does not accept legal responsibility for the contents
of this message. The recipient(s) must be aware that e-mail is not a
secure communication medium and that the contents of this mail may
have been altered by a third party in transit.
If you have any issues regarding this mail please contact:
administrator at zsl.org.




More information about the R-sig-Geo mailing list