[Rd] [R] variance/mean

Tue Mar 24 02:17:41 CET 2009

Doesn't Fortran still require that the arguments to
a function not alias each other (in whole or in part)?
I could imagine that var() might call into Fortran code
(BLAS or LAPACK).  Wouldn you want to chance erroneous
results  at a high optimization level to save a bit of
time in an unusual situation?

(I could also imagine someone changing the R interpreter
so that x and x[-length(x)] could share the same memory
block and that could cause Fortran aliasing problems as
well.)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek Kusnierczyk
> Sent: Monday, March 23, 2009 4:40 PM
> To: r-devel at r-project.org
> Cc: r-help at r-project.org; rkevinburton at charter.net; Bert Gunter
> Subject: Re: [Rd] [R] variance/mean
> 
> 
> (this post suggests a patch to the sources, so i allow myself 
> to divert
> it to r-devel)
> 
> Bert Gunter wrote:
> > x a numeric vector, matrix or data frame. 
> > y NULL (default) or a vector, matrix or data frame with compatible
> > dimensions to x. The default is equivalent to y = x (but 
> more efficient). 
> >
> >   
> bert points to an interesting fragment of ?var:  it suggests that
> computing var(x) is more efficient than computing var(x,x), for any x
> valid as input to var.  indeed:
> 
>     set.seed(0)
>     x = matrix(rnorm(10000), 100, 100)
> 
>     library(rbenchmark)
>     benchmark(replications=1000, columns=c('test', 'elapsed'),
>        var(x),
>        var(x, x))
>     #        test elapsed
>     # 1    var(x)   1.091
>     # 2 var(x, x)   2.051
> 
> that's of course, so to speak, unreasonable:  for what var(x) does is
> actually computing the covariance of x and x, which should be the same
> as var(x,x). 
> 
> the hack is that if y is given, there's an overhead of memory 
> allocation
> for *both* x and y when y is given, as seen in src/main/cov.c:720+.
> incidentally, it seems that the problem can be solved with a 
> trivial fix
> (see the attached patch), so that
> 
>     set.seed(0)
>     x = matrix(rnorm(10000), 100, 100)
> 
>     library(rbenchmark)
>     benchmark(replications=1000, columns=c('test', 'elapsed'),
>        var(x),
>        var(x, x))
>     #        test elapsed
>     # 1    var(x)   1.121
>     # 2 var(x, x)   1.107
> 
> with the quick checks
> 
>     all.equal(var(x), var(x, x))
>     # TRUE
>    
>     all(var(x) == var(x, x))
>     # TRUE
> 
> and for cor it seems to make cor(x,x) slightly faster than 
> cor(x), while
> originally it was twice slower:
> 
>     # original
>     benchmark(replications=1000, columns=c('test', 'elapsed'),
>        cor(x),
>        cor(x, x))
>     #        test elapsed
>     # 1    cor(x)   1.196
>     # 2 cor(x, x)   2.253
>    
>     # patched
>     benchmark(replications=1000, columns=c('test', 'elapsed'),
>        cor(x),
>        cor(x, x))
>     #        test elapsed
>     # 1    cor(x)   1.207
>     # 2 cor(x, x)   1.204
> 
> (there is a visible penalty due to an additional pointer 
> test, but it's
> 10ms on 1000 replications with 10000 data points, which i think is
> negligible.)
> 
> > This is as clear as I would know how to state. 
> 
> i believe bert is right.
> 
> however, with the above fix, this can now be rewritten as:
> 
> "
> x: a numeric vector, matrix or data frame. 
> y: a vector, matrix or data frame with dimensions compatible 
> to those of x. 
> By default, y = x. 
> "
> 
> which, to my simple mind, is even more clear than what bert would know
> how to state, and less likely to cause the sort of confusion that
> originated this thread.
> 
> the attached patch suggests modifications to src/main/cov.c and
> src/library/stats/man/cor.Rd.
> it has been prepared and checked as follows:
> 
>     svn co https://svn.r-project.org/R/trunk trunk
>     cd trunk
>     # edited the sources
>     svn diff > cov.diff
>     svn revert -R src
>     patch -p0 < cov.diff
> 
>     tools/rsync-recommended
>     ./configure
>     make
>     make check
>     bin/R
>     # subsequent testing within R
> 
> if you happen to consider this patch for a commit, please be sure to
> examine and test it carefully first.
> 
> vQ
>