[Rd] [R] variance/mean
Martin Maechler
maechler at stat.math.ethz.ch
Wed Apr 1 10:09:33 CEST 2009
>>>>> Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
>>>>> on Tue, 24 Mar 2009 00:39:58 +0100 writes:
> (this post suggests a patch to the sources, so i allow myself to divert
> it to r-devel)
> Bert Gunter wrote:
>> x a numeric vector, matrix or data frame.
>> y NULL (default) or a vector, matrix or data frame with compatible
>> dimensions to x. The default is equivalent to y = x (but more efficient).
>>
>>
> bert points to an interesting fragment of ?var: it suggests that
> computing var(x) is more efficient than computing var(x,x), for any x
> valid as input to var. indeed:
> set.seed(0)
> x = matrix(rnorm(10000), 100, 100)
> library(rbenchmark)
> benchmark(replications=1000, columns=c('test', 'elapsed'),
> var(x),
> var(x, x))
> # test elapsed
> # 1 var(x) 1.091
> # 2 var(x, x) 2.051
> that's of course, so to speak, unreasonable: for what var(x) does is
> actually computing the covariance of x and x, which should be the same
> as var(x,x).
> the hack is that if y is given, there's an overhead of memory allocation
> for *both* x and y when y is given, as seen in src/main/cov.c:720+.
> incidentally, it seems that the problem can be solved with a trivial fix
> (see the attached patch), so that
> set.seed(0)
> x = matrix(rnorm(10000), 100, 100)
> library(rbenchmark)
> benchmark(replications=1000, columns=c('test', 'elapsed'),
> var(x),
> var(x, x))
> # test elapsed
> # 1 var(x) 1.121
> # 2 var(x, x) 1.107
> with the quick checks
> all.equal(var(x), var(x, x))
> # TRUE
> all(var(x) == var(x, x))
> # TRUE
> and for cor it seems to make cor(x,x) slightly faster than cor(x), while
> originally it was twice slower:
> # original
> benchmark(replications=1000, columns=c('test', 'elapsed'),
> cor(x),
> cor(x, x))
> # test elapsed
> # 1 cor(x) 1.196
> # 2 cor(x, x) 2.253
> # patched
> benchmark(replications=1000, columns=c('test', 'elapsed'),
> cor(x),
> cor(x, x))
> # test elapsed
> # 1 cor(x) 1.207
> # 2 cor(x, x) 1.204
> (there is a visible penalty due to an additional pointer test, but it's
> 10ms on 1000 replications with 10000 data points, which i think is
> negligible.)
>> This is as clear as I would know how to state.
> i believe bert is right.
> however, with the above fix, this can now be rewritten as:
> "
> x: a numeric vector, matrix or data frame.
> y: a vector, matrix or data frame with dimensions compatible to those of x.
> By default, y = x.
> "
> which, to my simple mind, is even more clear than what bert would know
> how to state, and less likely to cause the sort of confusion that
> originated this thread.
Your patch is basically only affecting the default
method = "pearson". For (most) other cases, 'y = NULL' would
still remain *the* way to save computations, unless we'd start
to use an R-level equivalent [which I think does not exist] of
your C trick (DATAPTR(x) == DATAPTR(y)).
Also, for S- and R- backcompatibility reasons, we'd need to
continue allowing y = NULL (as your patch would, too), so
currently I think this whole idea -- as slick as it is, I
learned something! --
does not make sense applying here.
> the attached patch suggests modifications to src/main/cov.c and
> src/library/stats/man/cor.Rd.
BTW: since you didn't (and shouldn't , because of method != "pearson" !)
change the R code, the docs \usage{.} part should not have been
changed either !
and as I mentioned: using 'y = NULL' in the function call must
continue to work, hence should also be documented as
possibility
==> the docs would not really become more clear, I think
Martin Maechler, ETH Zurich
> it has been prepared and checked as follows:
> svn co https://svn.r-project.org/R/trunk trunk
> cd trunk
> # edited the sources
> svn diff > cov.diff
> svn revert -R src
> patch -p0 < cov.diff
> tools/rsync-recommended
> ./configure
> make
> make check
> bin/R
> # subsequent testing within R
> if you happen to consider this patch for a commit, please be sure to
> examine and test it carefully first.
> vQ
> Content-Type: text/x-diff; name="cov.diff"
> Content-Disposition: inline; filename="cov.diff"
> Content-ID: <18899.7024.520234.153929 at lynne.math.ethz.ch>
> Content-Transfer-Encoding: binary
> [Deleted text/x-diff]
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list