[R] scale or not to scale that is the question - prcomp
Duncan Murdoch
murdoch at stats.uwo.ca
Wed Aug 19 16:29:07 CEST 2009
On 8/19/2009 10:14 AM, Petr PIKAL wrote:
> Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 15:25:00:
>
>> On 19/08/2009 9:02 AM, Petr PIKAL wrote:
>> > Thank you
>> >
>> > Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 14:49:52:
>> >
>> >> On 19/08/2009 8:31 AM, Petr PIKAL wrote:
>> >>> Dear all
>> >>>
>> >
>> > <snip>
>> >
>> >> I would say the answer depends on the meaning of the variables. In
> the
>> >> unusual case that they are measured in dimensionless units, it might
>> >> make sense not to scale. But if you are using arbitrary units of
>> >> measurement, do you want your answer to depend on them? For example,
> if
>> >
>> >> you change from Kg to mg, the numbers will become much larger, the
>> >> variable will contribute much more variance, and it will become a
> more
>> >> important part of the largest principal component. Is that sensible?
>> >
>> > Basically variables are in percentages (all between 0 and 6%) except
> dus
>> > which is present or not present (for the purpose of prcomp transformed
> to
>> > 0/1 by as.numeric:). The only variable which is not such is iep which
> is
>> > basically in range 5-8. So ranges of all variables are quite similar.
>> >
>> > What surprises me is that biplot without scaling I can interpret by
> used
>> > variables while biplot with scaling is totally different and those two
>
>> > pictures does not match at all. This is what surprised me as I would
>> > expected just a small difference between results from those two
> settings
>> > as all numbers are quite comparable and does not differ much.
>>
>>
>> If you look at the standard deviations in the two cases, I think you can
>
>> see why this happens:
>>
>> Scaled:
>>
>> Standard deviations:
>> [1] 1.3335175 1.2311551 1.0583667 0.7258295 0.2429397
>>
>> Not Scaled:
>>
>> Standard deviations:
>> [1] 1.0030048 0.8400923 0.5679976 0.3845088 0.1531582
>>
>>
>> The first two sds are close, so small changes to the data will affect
>
> I see. But I would expect that changes to data made by scaling would not
> change it in such a way that unscaled and scaled results are completely
> different.
>
>> their direction a lot. Your biplots look at the 2nd and 3rd components.
>
> Yes because grouping in 2nd and 3rd component biplot can be easily
> explained by values of some variables (without scaling).
>
> I must admit that I do not use prcomp much often and usually scaling can
> give me "explainable" result, especially if I use it to "variable
> reduction". Therefore I am reluctant to use it in this case.
>
> when I try "more standard" way
>
>> fit<-lm(iep~sio2+al2o3+p2o5+as.numeric(dus), data=rglp)
>> summary(fit)
>
> Call:
> lm(formula = iep ~ sio2 + al2o3 + p2o5 + as.numeric(dus), data = rglp)
>
> Residuals:
> Min 1Q Median 3Q Max
> -0.41751 -0.15568 -0.03613 0.20124 0.43046
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 7.12085 0.62257 11.438 8.24e-08 ***
> sio2 -0.67250 0.20953 -3.210 0.007498 **
> al2o3 0.40534 0.08641 4.691 0.000522 ***
> p2o5 -0.76909 0.11103 -6.927 1.59e-05 ***
> as.numeric(dus) -0.64020 0.18101 -3.537 0.004094 **
>
> I get quite plausible result which can be interpreted without problems.
>
> My data is a result of designed experiment (more or less :) and therefore
> all variables are significant. Is that the reason why scaling may bye
> inappropriate in this case?
No, I think it's just that the cloud of points is approximately
spherical in the first 2 or 3 principal components, so the principal
component directions are somewhat arbitrary. You just got lucky that
the 2nd and 3rd components are interpretable: I wouldn't put too much
faith in being able to repeat that if you went out and collected a new
set of data using the same design.
Duncan Murdoch
More information about the R-help
mailing list