[R-sig-eco] PCA as a predictive model
Bob O'Hara
bohara at senckenberg.de
Wed May 23 11:15:31 CEST 2012
On 05/23/2012 10:55 AM, Marc Taylor wrote:
> Hi Jari - one more question if you don't mind. Since the weights of the PCs
> are related to the the amount of variance that they explain in the original
> data - is it problematic to predict the PC scores with a second data set
> that has a different amount of variance (e.g. due to differing number of
> samples)? In both the 1st and 2nd data sets I have been using scaled values
> for the variables (mean=0 and sd=1 for each sample).
> Cheers,
> Marc
I'll pretend to be Jari for a moment. :-)
PCA just scales and rotates the data in cunning ways, so with the new
data you need to scale and rotate it in the same way. If you scale the
values first then you've already changed the scaling.
What you need to do is either do PCA on the raw data or scale the new
data using the mean and varianes of the old data.
library(MASS)
NVar=5; NObs=50
Sigma=matrix(c(
10,0.2, 0, 0,0.4,
0.2, 5,0.1, 0,0.6,
0,0.1,1.0, 0.2,0,
0, 0,0.2, 5, 0,
0.4,0.6, 0, 0,1), nrow=5)
# simulate data
Data=mvrnorm(NObs, rnorm(NVar), Sigma=Sigma)
# Do PCA on scaled data
Data.Sc=scale(Data)
PC=princomp(Data.Sc)
# Simulate new data
NewData=mvrnorm(10, rnorm(NVar), Sigma=Sigma)
# Do PCA on new data. First do it wrong...
PC.wrong=predict(PC, newdata=scale(NewData))
# Now scale correctly
NewData.Sc=scale(NewData, center=attr(Data.Sc, "scaled:center"),
scale=attr(Data.Sc, "scaled:scale")
PC.right=predict(PC, newdata=NewData.Sc)
HTH
Bob
--
Bob O'Hara
Biodiversity and Climate Research Centre
Senckenberganlage 25
D-60325 Frankfurt am Main,
Germany
Tel: +49 69 798 40226
Mobile: +49 1515 888 5440
WWW: http://www.bik-f.de/root/index.php?page_id=219
Blog: http://blogs.nature.com/boboh
Journal of Negative Results - EEB: www.jnr-eeb.org
More information about the R-sig-ecology
mailing list