[Rd] [R] variance/mean
William Dunlap
wdunlap at tibco.com
Tue Mar 24 02:17:41 CET 2009
Doesn't Fortran still require that the arguments to
a function not alias each other (in whole or in part)?
I could imagine that var() might call into Fortran code
(BLAS or LAPACK). Wouldn you want to chance erroneous
results at a high optimization level to save a bit of
time in an unusual situation?
(I could also imagine someone changing the R interpreter
so that x and x[-length(x)] could share the same memory
block and that could cause Fortran aliasing problems as
well.)
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
> -----Original Message-----
> From: r-devel-bounces at r-project.org
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek Kusnierczyk
> Sent: Monday, March 23, 2009 4:40 PM
> To: r-devel at r-project.org
> Cc: r-help at r-project.org; rkevinburton at charter.net; Bert Gunter
> Subject: Re: [Rd] [R] variance/mean
>
>
> (this post suggests a patch to the sources, so i allow myself
> to divert
> it to r-devel)
>
> Bert Gunter wrote:
> > x a numeric vector, matrix or data frame.
> > y NULL (default) or a vector, matrix or data frame with compatible
> > dimensions to x. The default is equivalent to y = x (but
> more efficient).
> >
> >
> bert points to an interesting fragment of ?var: it suggests that
> computing var(x) is more efficient than computing var(x,x), for any x
> valid as input to var. indeed:
>
> set.seed(0)
> x = matrix(rnorm(10000), 100, 100)
>
> library(rbenchmark)
> benchmark(replications=1000, columns=c('test', 'elapsed'),
> var(x),
> var(x, x))
> # test elapsed
> # 1 var(x) 1.091
> # 2 var(x, x) 2.051
>
> that's of course, so to speak, unreasonable: for what var(x) does is
> actually computing the covariance of x and x, which should be the same
> as var(x,x).
>
> the hack is that if y is given, there's an overhead of memory
> allocation
> for *both* x and y when y is given, as seen in src/main/cov.c:720+.
> incidentally, it seems that the problem can be solved with a
> trivial fix
> (see the attached patch), so that
>
> set.seed(0)
> x = matrix(rnorm(10000), 100, 100)
>
> library(rbenchmark)
> benchmark(replications=1000, columns=c('test', 'elapsed'),
> var(x),
> var(x, x))
> # test elapsed
> # 1 var(x) 1.121
> # 2 var(x, x) 1.107
>
> with the quick checks
>
> all.equal(var(x), var(x, x))
> # TRUE
>
> all(var(x) == var(x, x))
> # TRUE
>
> and for cor it seems to make cor(x,x) slightly faster than
> cor(x), while
> originally it was twice slower:
>
> # original
> benchmark(replications=1000, columns=c('test', 'elapsed'),
> cor(x),
> cor(x, x))
> # test elapsed
> # 1 cor(x) 1.196
> # 2 cor(x, x) 2.253
>
> # patched
> benchmark(replications=1000, columns=c('test', 'elapsed'),
> cor(x),
> cor(x, x))
> # test elapsed
> # 1 cor(x) 1.207
> # 2 cor(x, x) 1.204
>
> (there is a visible penalty due to an additional pointer
> test, but it's
> 10ms on 1000 replications with 10000 data points, which i think is
> negligible.)
>
> > This is as clear as I would know how to state.
>
> i believe bert is right.
>
> however, with the above fix, this can now be rewritten as:
>
> "
> x: a numeric vector, matrix or data frame.
> y: a vector, matrix or data frame with dimensions compatible
> to those of x.
> By default, y = x.
> "
>
> which, to my simple mind, is even more clear than what bert would know
> how to state, and less likely to cause the sort of confusion that
> originated this thread.
>
> the attached patch suggests modifications to src/main/cov.c and
> src/library/stats/man/cor.Rd.
> it has been prepared and checked as follows:
>
> svn co https://svn.r-project.org/R/trunk trunk
> cd trunk
> # edited the sources
> svn diff > cov.diff
> svn revert -R src
> patch -p0 < cov.diff
>
> tools/rsync-recommended
> ./configure
> make
> make check
> bin/R
> # subsequent testing within R
>
> if you happen to consider this patch for a commit, please be sure to
> examine and test it carefully first.
>
> vQ
>
More information about the R-devel
mailing list