[R] variance/mean

Tue Mar 24 00:39:58 CET 2009

(this post suggests a patch to the sources, so i allow myself to divert
it to r-devel)

Bert Gunter wrote:
> x a numeric vector, matrix or data frame. 
> y NULL (default) or a vector, matrix or data frame with compatible
> dimensions to x. The default is equivalent to y = x (but more efficient). 
>
>   
bert points to an interesting fragment of ?var:  it suggests that
computing var(x) is more efficient than computing var(x,x), for any x
valid as input to var.  indeed:

    set.seed(0)
    x = matrix(rnorm(10000), 100, 100)

    library(rbenchmark)
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       var(x),
       var(x, x))
    #        test elapsed
    # 1    var(x)   1.091
    # 2 var(x, x)   2.051

that's of course, so to speak, unreasonable:  for what var(x) does is
actually computing the covariance of x and x, which should be the same
as var(x,x). 

the hack is that if y is given, there's an overhead of memory allocation
for *both* x and y when y is given, as seen in src/main/cov.c:720+.
incidentally, it seems that the problem can be solved with a trivial fix
(see the attached patch), so that

    set.seed(0)
    x = matrix(rnorm(10000), 100, 100)

    library(rbenchmark)
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       var(x),
       var(x, x))
    #        test elapsed
    # 1    var(x)   1.121
    # 2 var(x, x)   1.107

with the quick checks

    all.equal(var(x), var(x, x))
    # TRUE

    all(var(x) == var(x, x))
    # TRUE

and for cor it seems to make cor(x,x) slightly faster than cor(x), while
originally it was twice slower:

    # original
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       cor(x),
       cor(x, x))
    #        test elapsed
    # 1    cor(x)   1.196
    # 2 cor(x, x)   2.253

    # patched
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       cor(x),
       cor(x, x))
    #        test elapsed
    # 1    cor(x)   1.207
    # 2 cor(x, x)   1.204

(there is a visible penalty due to an additional pointer test, but it's
10ms on 1000 replications with 10000 data points, which i think is
negligible.)

> This is as clear as I would know how to state. 

i believe bert is right.

however, with the above fix, this can now be rewritten as:

"
x: a numeric vector, matrix or data frame. 
y: a vector, matrix or data frame with dimensions compatible to those of x. 
By default, y = x. 
"

which, to my simple mind, is even more clear than what bert would know
how to state, and less likely to cause the sort of confusion that
originated this thread.

the attached patch suggests modifications to src/main/cov.c and
src/library/stats/man/cor.Rd.
it has been prepared and checked as follows:

    svn co https://svn.r-project.org/R/trunk trunk
    cd trunk
    # edited the sources
    svn diff > cov.diff
    svn revert -R src
    patch -p0 < cov.diff

    tools/rsync-recommended
    ./configure
    make
    make check
    bin/R
    # subsequent testing within R

if you happen to consider this patch for a commit, please be sure to
examine and test it carefully first.

vQ