[Rd] scale(x, center=FALSE) (PR#14219)
Ben Bolker
bolker at ufl.edu
Thu Feb 25 15:30:54 CET 2010
<mrizzo <at> bgsu.edu> writes:
> scale returns incorrect values when center=FALSE and scale=TRUE.
>
> When center=FALSE, scale=TRUE, the "scale" used is not
> the square root of sample
> variance, the "scale" attribute is equal to sqrt(sum(x^2)/(n-1)).
>
> Example:
>
> x <- runif(10)
> n <- length(x)
>
> scaled <- scale(x, center=FALSE, scale=TRUE)
> scaled
> s.bad <- attr(scaled, "scale")
> s.bad #wrong
> sd(x) #correct
>
> #compute the sd as if data has already been centered
> #that is, compute the variance as sum(x^2)/(n-1)
>
> sqrt(sum(x^2)/(n-1))
>
>
Are you sure this is a bug? I agree that the way the function
behaves is (to me) mildly confusing, but the documentation says:
* The value of ‘scale’ determines how column scaling is performed
* (after centering). If ‘scale’ is a numeric vector with length
* equal to the number of columns of ‘x’, then each column of ‘x’ is
* divided by the corresponding value from ‘scale’. If ‘scale’ is
* ‘TRUE’ then scaling is done by dividing the (centered) columns of
* ‘x’ by their standard deviations, and if ‘scale’ is ‘FALSE’, no
* scaling is done.
* The standard deviation for a column is obtained by computing the
* square-root of the sum-of-squares of the non-missing values in the
* column divided by the number of non-missing values minus one
* (whether or not centering was done).
If you read the first clause of the last sentence of the first
paragraph in isolation, you would have the expectation that the
columns would be scaled by sd(x). However, the second paragraph
clearly states that the 'standard deviation' is defined here
as the root-mean-square over (n-1), that is, sqrt(sum(x^2)/(n-1)) ...
This does seem like a funny choice, but it is probably stuck
that way without an extremely compelling argument to the contrary.
If you want to scale columns by sd() instead you can say
scale(x,center=FALSE,scale=apply(x,2,sd))
Would you like to submit a patch for the documentation that
would preserve the sense, clarify the behavior, and not be
much longer than the current version ... ?
cheers
Ben Bolker
More information about the R-devel
mailing list