[R-sig-Geo] Principal Component Analysis - Selecting components? + right choice?

Stéphane Dray dray at biomserv.univ-lyon1.fr
Fri Dec 12 09:19:21 CET 2008


multispati function in ade4 implements a method wich is very similar to 
SFA of Switzer et al. It find linear combination of the variables that 
maximize the product variance by autocorrelation when PCA maximize only 
the first part and SFA only the second one. Relations between SFA and 
multispati are described in the paper:

*S. Dray*, S. Saïd, and F. Débias. Spatial ordination of vegetation data 
using a generalization of Wartenberg's multivariate spatial correlation. 
/Journal of Vegetation Science/, 19:45-56, 2008.

Edzer Pebesma wrote:
> Principle components don't search for directions that best explain 
> your dependent variable, but rather try to capture variability and/or 
> correlation in the predictors. Methods that look for subspaces that 
> best predict the dependent are for instance are partial least squares 
> and ridge regression. Using them, you could with the same amount of 
> degrees of freedom look at completely different directions.
>
> In addition to Ashton's remark: a variety of principle components that 
> tries to pick up spatial correlated patterns in addition to maximum 
> variability/correlation between variables is MNF (minimum nois 
> fraction) factors, also called min/max autocorrelation factors; see 
> the papers by Green, Switzer and others. I'm not aware of 
> implementations of them in R, but would be interested to hear.
>
> Best regards,
> -- 
> Edzer
>
> Ashton Shortridge wrote:
>> Hi Corrado,
>>
>>  
>>> I run the PCA using prcomp, quite successfully. Now I need to use a
>>> criteria to select the right number of PC. (that is: is it 1,2,3,4?)
>>>
>>> What criteria would you suggest?
>>>     
>>
>> that's an interesting and probably controversy-generating question. 
>> It's probably not an R-sig-geo question, either. I am not a PCA 
>> person, but the rule of thumb I am aware of is to plot the 
>> variability each component 'explains' and look for a clear 
>> breakpoint. I would think about any multivariate analysis text would 
>> have a better explanation than I can give, though.
>>
>> As for something more rigorous, I think a lot of people are reluctant 
>> to use PCA as a modeling approach not so much because it's hard to 
>> choose a threshold for selecting components, but because the 
>> interpretation of the meaning of each component is pretty subjective. 
>> If you want an explanatory model, be careful about using PCA. You 
>> would be better served by deciding, based perhaps on expert knowledge 
>> about the variables, which ones to use in the model and which ones 
>> not to.
>>
>> To try to make this a bit more spatial, and therefore more relevant 
>> to the list, I will also warn you that your various climate variables 
>> are almost certainly spatially autocorrelated - that is, neighboring 
>> and nearby observations in the grid are not independent. That has 
>> serious implications for standard multivariate analysis techniques 
>> and diagnostics.
>>
>> Yours,
>>
>> Ashton
>>
>> On Thursday 11 December 2008 06:46:37 am Corrado wrote:
>>  
>>> Dear R gurus,
>>>
>>> I have some climatic data for a region of the world. They are monthly
>>> averages 1950 -2000 of precipitation (12 months), minimum 
>>> temperature (12
>>> months), maximum temperature (12 months). I have scaled them to 2 km 
>>> x 2km
>>> cells, and I have around 75,000 cells.
>>>
>>> I need to feed them into a statistical model as co-variates, to use 
>>> them to
>>> predict a response variable.
>>>
>>> The climatic data are obviously correlated: precipitation for 
>>> January is
>>> correlated to precipitation for February and so on .... even 
>>> precipitation
>>> and temperature are heavily correlated. I did some correlation 
>>> analysis and
>>> they are all strongly correlated.
>>>
>>> I though of running PCA on them, in order to reduce the number of
>>> co-variates I feed into the model.
>>>
>>> I run the PCA using prcomp, quite successfully. Now I need to use a
>>> criteria to select the right number of PC. (that is: is it 1,2,3,4?)
>>>
>>> What criteria would you suggest?
>>>
>>> At the moment, I am using a criteria based on threshold, but that is 
>>> highly
>>> subjective, even if there are some rules of thumb (Jolliffe,Principal
>>> Component Analysis, II Edition, Springer Verlag,2002).
>>>
>>> Could you suggest something more rigorous?
>>>
>>> By the way, do you think I would have been better off by using 
>>> something
>>> different from PCA?
>>>
>>> Best,
>>>     
>>
>>
>>
>>   
>

-- 
Stéphane DRAY (dray at biomserv.univ-lyon1.fr )
Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - Lyon I
43, Bd du 11 Novembre 1918, 69622 Villeurbanne Cedex, France
Tel: 33 4 72 43 27 57       Fax: 33 4 72 43 13 88
http://biomserv.univ-lyon1.fr/~dray/




More information about the R-sig-Geo mailing list