[R-sig-Geo] Principal Component Analysis - Selectingcomponents? + right choice?

G. Allegri giohappy at gmail.com
Fri Dec 12 01:13:15 CET 2008


I just share a link to a paper I've recently read, about the thread
object: http://dx.doi.org/10.1016/j.csda.2004.06.015

giovanni

2008/12/11 Kamran Safi <Kamran.Safi at ioz.ac.uk>:
> Hi Corrado,
>
> A useful reference is: Diniz-Filho J.A.F. & Bini L.M. (2005). Modelling
> geographical patterns in species richness using eigenvector-based
> spatial filters. Global Ecology and Biogeography, 14, 177-185.
>
> So what they do basically is using multidimensional scaling to derive
> from the pairwise distance matrix a set of principal coordinates (not be
> confused with principal components). MDS is a method (old!) that
> converts a pairwise distance matrix into n-1 dimensional vertical
> matrix. It is implemented in R base. Have a look at cmdscale() and
> isoMDS(). There I would use all those PCo which significantly predict
> your dependent variable(s) (But see below).
>
> Another very useful and classic resource is this paper:
> http://www.ufz.de/data/Dormann-et-al_Methods-Autocorrelation6834.pdf
> Specifically its appendix. Spatial filtering is described there too.
>
> Hope this helps. Good luck.
>
> Kamran
>
>
>
>
> ------------------------
> Kamran Safi
>
> Postdoctoral Research Fellow
> Institute of Zoology
> Zoological Society of London
> Regent's Park
> London NW1 4RY
>
> http://www.zoo.cam.ac.uk/ioz/people/safi.htm
>
> http://spatialr.googlepages.com
> http://asapi.wetpaint.com
>
>
> -----Original Message-----
> From: Corrado [mailto:ct529 at york.ac.uk]
> Sent: 11 December 2008 14:44
> To: Kamran Safi
> Cc: Ashton Shortridge; r-sig-geo at stat.math.ethz.ch
> Subject: Re: [R-sig-Geo] Principal Component Analysis -
> Selectingcomponents? + right choice?
>
> Hi all,
>
> I used a rule of thumb as reported by the book quoted, but I am not
> completely
> happy with it, because it is not really a statistical justification.
>
> I will try the broken stick approach, thanks!
>
> Concerning the interpretation, luckily enough PC1 has a clear
> interpretation.
> PC2 a bit less so, though .... and the complexity of interpretation
> increases
> with explained variance decreasing.
>
> I am using the approach suggested by Kamran: I have brewed down the
> original
> climatic variables to uncorrelated "environmental variables", and I have
>
> chosen the signification ones using a threshold. I do realise spatial
> auto-correlation is going to be important (even if my sites are fairly
> distant from one another, reducing the impact), but  do not know
> anything
> about spatial filtering. Whilst I have used distance matrices, I have
> never
> used them to remove spatial auto-correlation!
>
> Could you please point me out to some resources please?
>
> Best,
>
>
>
>
> On Thursday 11 December 2008 14:29:27 Kamran Safi wrote:
>> Hi all,
>>
>> I agree with Ashton. The issue is very complex and far from resolved.
>> But sometimes we have to go down the PCA path. Among the many possible
>> solutions is the broken stick approach, for which you find an R
> solution
>> (bstick()) in the package vegan. Technically the broken stick randomly
>> divides 100% variance into your N principal components and generates a
>> null expectation for the distributions of randomly partioning the
>> original variance. You then take all those PCAs that are above the
>> broken stick distribution. This is by no means an agreed upon
> approach,
>> but it is at least reproducible and has some theory behind it, but is
>> and will remain a rule of thumb.
>> In terms of spatial analysis you could derive the PCAs and then go
> into
>> classic spatial analysis. Although the interpretation of PCA is
>> sometimes complicated or even impossible, you can calculate the values
>> for every grid cell and then go into multivariate analysis whereby you
>> have to take spatial autocorrelation into account. At least your PCA
>> components are orthogonal, which simplifies your analysis in contrast
> to
>> using the original variables. It also allows you to produce predictive
>> models.
>> What you could think of doing could be using PCA to derive
>> "environmental" variables which are uncorrelated and the use the
>> distance matrix and spatial filtering to "remove" spatial
>> autocorrelation.
>>
>> Hope this helps,
>>
>> Kami
>>
>>
>>
>> ------------------------
>> Kamran Safi
>>
>> Postdoctoral Research Fellow
>> Institute of Zoology
>> Zoological Society of London
>> Regent's Park
>> London NW1 4RY
>>
>> http://www.zoo.cam.ac.uk/ioz/people/safi.htm
>>
>> http://spatialr.googlepages.com
>> http://asapi.wetpaint.com
>>
>> -----Original Message-----
>> From: r-sig-geo-bounces at stat.math.ethz.ch
>> [mailto:r-sig-geo-bounces at stat.math.ethz.ch] On Behalf Of Ashton
>> Shortridge
>> Sent: 11 December 2008 14:11
>> To: r-sig-geo at stat.math.ethz.ch
>> Cc: Corrado
>> Subject: Re: [R-sig-Geo] Principal Component Analysis -
>> Selectingcomponents? + right choice?
>>
>> Hi Corrado,
>>
>> > I run the PCA using prcomp, quite successfully. Now I need to use a
>> > criteria to select the right number of PC. (that is: is it 1,2,3,4?)
>> >
>> > What criteria would you suggest?
>>
>> that's an interesting and probably controversy-generating question.
> It's
>>
>> probably not an R-sig-geo question, either. I am not a PCA person, but
>> the
>> rule of thumb I am aware of is to plot the variability each
>> component 'explains' and look for a clear breakpoint. I would think
>> about any
>> multivariate analysis text would have a better explanation than I can
>> give,
>> though.
>>
>> As for something more rigorous, I think a lot of people are reluctant
> to
>> use
>> PCA as a modeling approach not so much because it's hard to choose a
>> threshold for selecting components, but because the interpretation of
>> the
>> meaning of each component is pretty subjective. If you want an
>> explanatory
>> model, be careful about using PCA. You would be better served by
>> deciding,
>> based perhaps on expert knowledge about the variables, which ones to
> use
>> in
>> the model and which ones not to.
>>
>> To try to make this a bit more spatial, and therefore more relevant to
>> the
>> list, I will also warn you that your various climate variables are
>> almost
>> certainly spatially autocorrelated - that is, neighboring and nearby
>> observations in the grid are not independent. That has serious
>> implications
>> for standard multivariate analysis techniques and diagnostics.
>>
>> Yours,
>>
>> Ashton
>>
>> On Thursday 11 December 2008 06:46:37 am Corrado wrote:
>> > Dear R gurus,
>> >
>> > I have some climatic data for a region of the world. They are
> monthly
>> > averages 1950 -2000 of precipitation (12 months), minimum
> temperature
>>
>> (12
>>
>> > months), maximum temperature (12 months). I have scaled them to 2 km
> x
>>
>> 2km
>>
>> > cells, and I have around 75,000 cells.
>> >
>> > I need to feed them into a statistical model as co-variates, to use
>>
>> them to
>>
>> > predict a response variable.
>> >
>> > The climatic data are obviously correlated: precipitation for
> January
>>
>> is
>>
>> > correlated to precipitation for February and so on .... even
>>
>> precipitation
>>
>> > and temperature are heavily correlated. I did some correlation
>>
>> analysis and
>>
>> > they are all strongly correlated.
>> >
>> > I though of running PCA on them, in order to reduce the number of
>> > co-variates I feed into the model.
>> >
>> > I run the PCA using prcomp, quite successfully. Now I need to use a
>> > criteria to select the right number of PC. (that is: is it 1,2,3,4?)
>> >
>> > What criteria would you suggest?
>> >
>> > At the moment, I am using a criteria based on threshold, but that is
>>
>> highly
>>
>> > subjective, even if there are some rules of thumb
> (Jolliffe,Principal
>> > Component Analysis, II Edition, Springer Verlag,2002).
>> >
>> > Could you suggest something more rigorous?
>> >
>> > By the way, do you think I would have been better off by using
>>
>> something
>>
>> > different from PCA?
>> >
>> > Best,
>
>
>
> --
> Corrado Topi
>
> Global Climate Change & Biodiversity Indicators
> Area 18,Department of Biology
> University of York, York, YO10 5YW, UK
> Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk
>
>
> This message has been scanned for viruses by MailControl -
> www.mailcontrol.com
>
> Click
> https://www.mailcontrol.com/sr/wQw0zmjPoHdJTZGyOCrrhg==
> vm6C4VtfQHmif9hp5cCl61pOCDRy5!TUZWDEGNPdlyNkQ==  to report this email as
> spam.
>
>
> The Zoological Society of London is incorporated by Royal Charter
> Principal Office England. Company Number RC000749
> Registered address:
> Regent's Park, London, England NW1 4RY
> Registered Charity in England and Wales no. 208728
>
> _________________________________________________________________________
> This e-mail has been sent in confidence to the named addressee(s).
> If you are not the intended recipient, you must not disclose or distribute
> it in any form, and you are asked to contact the sender immediately.
> Views or opinions expressed in this communication may not be those
> of The Zoological Society of London and, therefore, The Zoological
> Society of London does not accept legal responsibility for the contents
> of this message. The recipient(s) must be aware that e-mail is not a
> secure communication medium and that the contents of this mail may
> have been altered by a third party in transit.
> If you have any issues regarding this mail please contact:
> administrator at zsl.org.
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>




More information about the R-sig-Geo mailing list