[Rd] [R] computing the variance

Mon Dec 5 20:25:42 CET 2005

On 05-Dec-05 Martin Maechler wrote:
>     UweL> x <- c(1,2,3,4,5)
>     UweL> n <- length(x)
>     UweL> var(x)*(n-1)/n
> 
>     UweL> if you really want it.
> 
> It seems Insightful at some point in time have given in to
> this user request, and S-plus nowadays has
> an argument  "unbiased = TRUE"
> where the user can choose {to shoot (him/her)self in the leg and}
> require 'unbiased = FALSE'.
> {and there's also 'SumSquraes = FALSE' which allows to not
> require any division (by N or N-1)}
> 
> Since in some ``schools of statistics'' people are really still
> taught to use a 1/N variance, we could envisage to provide such an
> argument to var() {and cov()} as well.  Otherwise, people define
> their own variance function such as  
>       VAR <- function(x,....) .. N/(N-1)*var(x,...)
> Should we?

If people need to do this, such an option would be a convenience,
but I don't see that it has much further merit than that.

My view of how to calculate a "variance" is based, not directly
on the the "unbiased" issue, but on the following.

Suppose you define a RV X as a single value sampled from a finite
population of values X1,...,XN.

The variance of X is (or damn well should be) defined as

  Var(X) = E(X^2) - (E(X))^2

and this comes to (Sum(X^2) - (Sum(X)/N)^2))/(N-1).

So this is the variance of the set of values {X1,...,XN} from
that point of view. Similarly I like to preserve the analogy
by calling the "variance" of a set of sampled values {x1,...,xn}
the quantity caluclated by dividing by (n-1).

This of course links with "unbiased" in that when you use the
"1/(n-1)" definition you do have an unbiased estimator of Var(X).

And it ties in nicely with what I call "The Fundamental Formula
of the Analysis of Variance" (coincidentally mentioned recently
on R-help):

  Var[X](X) = E[J](Var[X|J](X|J)) + Var[J](E[X|J](X|J))

for two random variables Y, J (where "E[Y](...)", "E[X|Y](...)"
mean "Expectation using the distribution of Y or the conditional
distribution of X|Y respectively").

Now, for instance, take a set of samples of sizes n1,...,nk
indexed by j=1,...,k and pick a value X at random from the
N = n1+...+nk sample values. This gives rise also to a random
value of J (sample index) with P(J=j) = nj/N. Now apply the
FFAOV and you get the usual breakdown of the sum of squares.
And so on.

All that sort of thing is too fundamental to be undermined by
making more than the minimum necessary concession (i.e. convenient
for those who must) to the "1/n" view of the matter.

I think the confusion (and it does exist) arises (a) from the
hisytorical notion that "variance" = the mean squared deviation from
the mean (which I prefer to name by its description); together
with (b) that the Mean, Variance and such occur very early on in
even the least mathematical of introductory statistics courses,
and participants in these are invariaboy puzzled by the somewhat
mysterious appearnce on stage of "1/(n-1)" -- it can be easier
to sweep this under the carpet by saying "it doesn't make any
difference in large samples" etc.

> BTW: S+ even has the 'unbiased' argument for cor() where of course
> it really doesn't make any difference (!), and actually I think is
> rather misleading, since the sample correlation is not unbiased
> in almost all cases AFAICS.

Agreed. I wasn't aware of this -- what is the S-plus default?
(not that it matters ... ). A simply silly distinction, and
possibly a carry-over from the "confusion" described above
(e.g. "Since sample correlation is calculated as
 Cov(X,Y)/sqrt(Var(X)*Var(Y)), should we use the unbiased
or the biased versions of the Cov and the Vars?").

Hmmm.
Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 05-Dec-05                                       Time: 19:19:20
------------------------------ XFMail ------------------------------