[R] Box-Cox Transformation: Drastic differences when varying added constants

Sun May 16 19:01:26 CEST 2010

On 2010-05-16 6:22, Holger Steinmetz wrote:
>
> Dear experts,
>
> I tried to learn about Box-Cox-transformation but found the following thing:
>
> When I had to add a constant to make all values of the original variable
> positive, I found that
> the lambda estimates (box.cox.powers-function) differed dramatically
> depending on the specific constant chosen.

Let's say that x is such that 1/x has a Normal distribution,
i.e. lambda = -1.
Then y = (1/x) + b also has a Normal distribution.
But you're expecting 1/(x+b) to also have a Normal distribution.

>
> In addition, the correlation between the transformed variable and the
> original were not 1 (as I think it should be to use the transformed variable
> meaningfully) but much lower.

Again, your expectation is faulty. The relationship between the
original and transformed variables is not linear (otherwise,
why do the transformation?), but cor() computes the Pearson
correlation coefficient by default. Try method='spearman'.
Better yet, plot the transformed variables vs the original
variable for further enlightenment.

  -Peter Ehlers

>
> With higher added values (and a right skewed variable) the lambda estimate
> was even negative and the correlation between the transformed variable and
> the original varible was -.91!!?
>
> I guess that is something fundmental missing in my current thinking about
> box-cox...
>
> Best,
> Holger
>
>
> P.S. Here is what i did:
>
> # Creating of a skewed variable X (mixture of two normals)
> x1 = rnorm(120,0,.5)
> x2 = rnorm(40,2.5,2)
> X = c(x1,x2)
>
> # Adding a small constant
> Xnew1 = X +abs(min(X))+ .1
> box.cox.powers(Xnew1)
> Xtrans1 = Xnew1^.2682 #(the value of the lambda estimate)
>
> # Adding a larger constant
> Xnew2 = X +abs(min(X)) + 1
> box.cox.powers(Xnew2)
> Xtrans2 = Xnew2^-.2543 #(the value of the lambda estimate)
>
> #Plotting it all
> par(mfrow=c(3,2))
> hist(X)
> qqnorm(X)
> qqline(X,lty=2)
> hist(Xtrans1)
> qqnorm(Xtrans1)
> qqline(Xtrans1,lty=2)
> hist(Xtrans2)
> qqnorm(Xtrans2)
> qqline(Xtrans2,lty=2)
>
> #correlation among original and transformed variables
> round(cor(cbind(X,Xtrans1,Xtrans2)),2)

--