[R] scale or not to scale that is the question - prcomp

Petr PIKAL petr.pikal at precheza.cz
Thu Aug 20 08:06:49 CEST 2009


OIS

Thank you both for pointing me to it. I did not notice this as the 
unscaled position of points was quite clear and strightforward according 
to my knowledge of data. The scaled plot is slightly more distorted and 
the relationships are not so obvious.

Thank you both

Petr Pikal
petr.pikal at precheza.cz
724008364, 581 252 256, 
581 252 140, 581 252 257


Kevin Wright <kw.stat at gmail.com> napsal dne 19.08.2009 18:33:12:

> If you mentally rotate the second biplot by 90 degrees, the plots are 
not so 
> different.  This just indicates that the 2nd and 3rd principal 
components have
> switched roles.
> 
> Kevin Wright
> 

> On Wed, Aug 19, 2009 at 10:09 AM, Petr PIKAL <petr.pikal at precheza.cz> 
wrote:
> Ok
> 
> Thank you for your time.
> 
> Best regards
> Petr Pikal
> 
> Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 16:29:07:
> 
> > On 8/19/2009 10:14 AM, Petr PIKAL wrote:
> > > Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 
15:25:00:
> > >
> > >> On 19/08/2009 9:02 AM, Petr PIKAL wrote:
> > >> > Thank you
> > >> >
> > >> > Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009
> 14:49:52:
> > >> >
> > >> >> On 19/08/2009 8:31 AM, Petr PIKAL wrote:
> > >> >>> Dear all
> > >> >>>
> > >> >
> > >> > <snip>
> > >> >
> > >> >> I would say the answer depends on the meaning of the variables. 
In
> 
> > > the
> > >> >> unusual case that they are measured in dimensionless units, it
> might
> > >> >> make sense not to scale.  But if you are using arbitrary units 
of
> > >> >> measurement, do you want your answer to depend on them?  For
> example,
> > > if
> > >> >
> > >> >> you change from Kg to mg, the numbers will become much larger, 
the
> 
> > >> >> variable will contribute much more variance, and it will become 
a
> > > more
> > >> >> important part of the largest principal component.  Is that
> sensible?
> > >> >
> > >> > Basically variables are in percentages (all between 0 and 6%)
> except
> > > dus
> > >> > which is present or not present (for the purpose of prcomp
> transformed
> > > to
> > >> > 0/1 by as.numeric:). The only variable which is not such is iep
> which
> > > is
> > >> > basically in range 5-8. So ranges of all variables are quite
> similar.
> > >> >
> > >> > What surprises me is that biplot without scaling I can interpret 
by
> 
> > > used
> > >> > variables while biplot with scaling is totally different and 
those
> two
> > >
> > >> > pictures does not match at all. This is what surprised me as I
> would
> > >> > expected just a small difference between results from those two
> > > settings
> > >> > as all numbers are quite comparable and does not differ much.
> > >>
> > >>
> > >> If you look at the standard deviations in the two cases, I think 
you
> can
> > >
> > >> see why this happens:
> > >>
> > >> Scaled:
> > >>
> > >> Standard deviations:
> > >> [1] 1.3335175 1.2311551 1.0583667 0.7258295 0.2429397
> > >>
> > >> Not Scaled:
> > >>
> > >> Standard deviations:
> > >> [1] 1.0030048 0.8400923 0.5679976 0.3845088 0.1531582
> > >>
> > >>
> > >> The first two sds are close, so small changes to the data will 
affect
> 
> > >
> > > I see. But I would expect that changes to data made by scaling would
> not
> > > change it in such a way that unscaled and scaled results are
> completely
> > > different.
> > >
> > >> their direction a lot.  Your biplots look at the 2nd and 3rd
> components.
> > >
> > > Yes because grouping in 2nd and 3rd component biplot can be easily
> > > explained by values of some variables (without scaling).
> > >
> > > I must admit that I do not use prcomp much often and usually scaling
> can
> > > give me "explainable" result, especially if I use it to "variable
> > > reduction". Therefore I am reluctant to use it in this case.
> > >
> > > when I try "more standard" way
> > >
> > >> fit<-lm(iep~sio2+al2o3+p2o5+as.numeric(dus), data=rglp)
> > >> summary(fit)
> > >
> > > Call:
> > > lm(formula = iep ~ sio2 + al2o3 + p2o5 + as.numeric(dus), data = 
rglp)
> > >
> > > Residuals:
> > >      Min       1Q   Median       3Q      Max
> > > -0.41751 -0.15568 -0.03613  0.20124  0.43046
> > >
> > > Coefficients:
> > >                 Estimate Std. Error t value Pr(>|t|)
> > > (Intercept)      7.12085    0.62257  11.438 8.24e-08 ***
> > > sio2            -0.67250    0.20953  -3.210 0.007498 **
> > > al2o3            0.40534    0.08641   4.691 0.000522 ***
> > > p2o5            -0.76909    0.11103  -6.927 1.59e-05 ***
> > > as.numeric(dus) -0.64020    0.18101  -3.537 0.004094 **
> > >
> > > I get quite plausible result which can be interpreted without
> problems.
> > >
> > > My data is a result of designed experiment (more or less :) and
> therefore
> > > all variables are significant. Is that the reason why scaling may 
bye
> > > inappropriate in this case?
> >
> > No, I think it's just that the cloud of points is approximately
> > spherical in the first 2 or 3 principal components, so the principal
> > component directions are somewhat arbitrary.  You just got lucky that
> > the 2nd and 3rd components are interpretable:  I wouldn't put too much
> > faith in being able to repeat that if you went out and collected a new
> > set of data using the same design.
> >
> > Duncan Murdoch
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list