[R] variance/mean
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Tue Mar 24 00:39:58 CET 2009
(this post suggests a patch to the sources, so i allow myself to divert
it to r-devel)
Bert Gunter wrote:
> x a numeric vector, matrix or data frame.
> y NULL (default) or a vector, matrix or data frame with compatible
> dimensions to x. The default is equivalent to y = x (but more efficient).
>
>
bert points to an interesting fragment of ?var: it suggests that
computing var(x) is more efficient than computing var(x,x), for any x
valid as input to var. indeed:
set.seed(0)
x = matrix(rnorm(10000), 100, 100)
library(rbenchmark)
benchmark(replications=1000, columns=c('test', 'elapsed'),
var(x),
var(x, x))
# test elapsed
# 1 var(x) 1.091
# 2 var(x, x) 2.051
that's of course, so to speak, unreasonable: for what var(x) does is
actually computing the covariance of x and x, which should be the same
as var(x,x).
the hack is that if y is given, there's an overhead of memory allocation
for *both* x and y when y is given, as seen in src/main/cov.c:720+.
incidentally, it seems that the problem can be solved with a trivial fix
(see the attached patch), so that
set.seed(0)
x = matrix(rnorm(10000), 100, 100)
library(rbenchmark)
benchmark(replications=1000, columns=c('test', 'elapsed'),
var(x),
var(x, x))
# test elapsed
# 1 var(x) 1.121
# 2 var(x, x) 1.107
with the quick checks
all.equal(var(x), var(x, x))
# TRUE
all(var(x) == var(x, x))
# TRUE
and for cor it seems to make cor(x,x) slightly faster than cor(x), while
originally it was twice slower:
# original
benchmark(replications=1000, columns=c('test', 'elapsed'),
cor(x),
cor(x, x))
# test elapsed
# 1 cor(x) 1.196
# 2 cor(x, x) 2.253
# patched
benchmark(replications=1000, columns=c('test', 'elapsed'),
cor(x),
cor(x, x))
# test elapsed
# 1 cor(x) 1.207
# 2 cor(x, x) 1.204
(there is a visible penalty due to an additional pointer test, but it's
10ms on 1000 replications with 10000 data points, which i think is
negligible.)
> This is as clear as I would know how to state.
i believe bert is right.
however, with the above fix, this can now be rewritten as:
"
x: a numeric vector, matrix or data frame.
y: a vector, matrix or data frame with dimensions compatible to those of x.
By default, y = x.
"
which, to my simple mind, is even more clear than what bert would know
how to state, and less likely to cause the sort of confusion that
originated this thread.
the attached patch suggests modifications to src/main/cov.c and
src/library/stats/man/cor.Rd.
it has been prepared and checked as follows:
svn co https://svn.r-project.org/R/trunk trunk
cd trunk
# edited the sources
svn diff > cov.diff
svn revert -R src
patch -p0 < cov.diff
tools/rsync-recommended
./configure
make
make check
bin/R
# subsequent testing within R
if you happen to consider this patch for a commit, please be sure to
examine and test it carefully first.
vQ
More information about the R-help
mailing list