[R-sig-eco] PCA as a predictive model

Bob O'Hara bohara at senckenberg.de
Wed May 23 11:15:31 CEST 2012


On 05/23/2012 10:55 AM, Marc Taylor wrote:
> Hi Jari - one more question if you don't mind. Since the weights of the PCs
> are related to the the amount of variance that they explain in the original
> data - is it problematic to predict the PC scores with a second data set
> that has a different amount of variance (e.g. due to differing number of
> samples)? In both the 1st and 2nd data sets I have been using scaled values
> for the variables (mean=0 and sd=1 for each sample).
> Cheers,
> Marc
I'll pretend to be Jari for a moment. :-)

PCA just scales and rotates the data in cunning ways, so with the new 
data you need to scale and rotate it in the same way. If you scale the 
values first then you've already changed the scaling.

What you need to do is either do PCA on the raw data or scale the new 
data using the mean and varianes of the old data.

library(MASS)

NVar=5; NObs=50
Sigma=matrix(c(
  10,0.2,   0, 0,0.4,
0.2,   5,0.1, 0,0.6,
    0,0.1,1.0, 0.2,0,
    0,   0,0.2, 5, 0,
0.4,0.6,  0, 0,1), nrow=5)

# simulate data
Data=mvrnorm(NObs, rnorm(NVar), Sigma=Sigma)
# Do PCA on scaled data
Data.Sc=scale(Data)
PC=princomp(Data.Sc)

# Simulate new data
NewData=mvrnorm(10, rnorm(NVar), Sigma=Sigma)
# Do PCA on new data. First do it wrong...
PC.wrong=predict(PC, newdata=scale(NewData))

# Now scale correctly

NewData.Sc=scale(NewData, center=attr(Data.Sc, "scaled:center"), 
scale=attr(Data.Sc, "scaled:scale")
PC.right=predict(PC, newdata=NewData.Sc)

HTH

Bob

-- 

Bob O'Hara

Biodiversity and Climate Research Centre
Senckenberganlage 25
D-60325 Frankfurt am Main,
Germany

Tel: +49 69 798 40226
Mobile: +49 1515 888 5440
WWW:   http://www.bik-f.de/root/index.php?page_id=219
Blog: http://blogs.nature.com/boboh
Journal of Negative Results - EEB: www.jnr-eeb.org



More information about the R-sig-ecology mailing list