[Rd] [R] variance/mean

Tue Mar 24 02:25:55 CET 2009

Oops, I was thinking backwards.  This sort of
hack could avoid the Fortran aliasing rules, not
run afoul of them.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of William Dunlap
> Sent: Monday, March 23, 2009 6:18 PM
> To: Wacek Kusnierczyk; r-devel at r-project.org
> Subject: Re: [Rd] [R] variance/mean
> 
> Doesn't Fortran still require that the arguments to
> a function not alias each other (in whole or in part)?
> I could imagine that var() might call into Fortran code
> (BLAS or LAPACK).  Wouldn you want to chance erroneous
> results  at a high optimization level to save a bit of
> time in an unusual situation?
> 
> (I could also imagine someone changing the R interpreter
> so that x and x[-length(x)] could share the same memory
> block and that could cause Fortran aliasing problems as
> well.)
> 
> Bill Dunlap
> TIBCO Software Inc - Spotfire Division
> wdunlap tibco.com  
> 
> > -----Original Message-----
> > From: r-devel-bounces at r-project.org 
> > [mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek 
> Kusnierczyk
> > Sent: Monday, March 23, 2009 4:40 PM
> > To: r-devel at r-project.org
> > Cc: r-help at r-project.org; rkevinburton at charter.net; Bert Gunter
> > Subject: Re: [Rd] [R] variance/mean
> > 
> > 
> > (this post suggests a patch to the sources, so i allow myself 
> > to divert
> > it to r-devel)
> > 
> > Bert Gunter wrote:
> > > x a numeric vector, matrix or data frame. 
> > > y NULL (default) or a vector, matrix or data frame with compatible
> > > dimensions to x. The default is equivalent to y = x (but 
> > more efficient). 
> > >
> > >   
> > bert points to an interesting fragment of ?var:  it suggests that
> > computing var(x) is more efficient than computing var(x,x), 
> for any x
> > valid as input to var.  indeed:
> > 
> >     set.seed(0)
> >     x = matrix(rnorm(10000), 100, 100)
> > 
> >     library(rbenchmark)
> >     benchmark(replications=1000, columns=c('test', 'elapsed'),
> >        var(x),
> >        var(x, x))
> >     #        test elapsed
> >     # 1    var(x)   1.091
> >     # 2 var(x, x)   2.051
> > 
> > that's of course, so to speak, unreasonable:  for what 
> var(x) does is
> > actually computing the covariance of x and x, which should 
> be the same
> > as var(x,x). 
> > 
> > the hack is that if y is given, there's an overhead of memory 
> > allocation
> > for *both* x and y when y is given, as seen in src/main/cov.c:720+.
> > incidentally, it seems that the problem can be solved with a 
> > trivial fix
> > (see the attached patch), so that
> > 
> >     set.seed(0)
> >     x = matrix(rnorm(10000), 100, 100)
> > 
> >     library(rbenchmark)
> >     benchmark(replications=1000, columns=c('test', 'elapsed'),
> >        var(x),
> >        var(x, x))
> >     #        test elapsed
> >     # 1    var(x)   1.121
> >     # 2 var(x, x)   1.107
> > 
> > with the quick checks
> > 
> >     all.equal(var(x), var(x, x))
> >     # TRUE
> >    
> >     all(var(x) == var(x, x))
> >     # TRUE
> > 
> > and for cor it seems to make cor(x,x) slightly faster than 
> > cor(x), while
> > originally it was twice slower:
> > 
> >     # original
> >     benchmark(replications=1000, columns=c('test', 'elapsed'),
> >        cor(x),
> >        cor(x, x))
> >     #        test elapsed
> >     # 1    cor(x)   1.196
> >     # 2 cor(x, x)   2.253
> >    
> >     # patched
> >     benchmark(replications=1000, columns=c('test', 'elapsed'),
> >        cor(x),
> >        cor(x, x))
> >     #        test elapsed
> >     # 1    cor(x)   1.207
> >     # 2 cor(x, x)   1.204
> > 
> > (there is a visible penalty due to an additional pointer 
> > test, but it's
> > 10ms on 1000 replications with 10000 data points, which i think is
> > negligible.)
> > 
> > > This is as clear as I would know how to state. 
> > 
> > i believe bert is right.
> > 
> > however, with the above fix, this can now be rewritten as:
> > 
> > "
> > x: a numeric vector, matrix or data frame. 
> > y: a vector, matrix or data frame with dimensions compatible 
> > to those of x. 
> > By default, y = x. 
> > "
> > 
> > which, to my simple mind, is even more clear than what bert 
> would know
> > how to state, and less likely to cause the sort of confusion that
> > originated this thread.
> > 
> > the attached patch suggests modifications to src/main/cov.c and
> > src/library/stats/man/cor.Rd.
> > it has been prepared and checked as follows:
> > 
> >     svn co https://svn.r-project.org/R/trunk trunk
> >     cd trunk
> >     # edited the sources
> >     svn diff > cov.diff
> >     svn revert -R src
> >     patch -p0 < cov.diff
> > 
> >     tools/rsync-recommended
> >     ./configure
> >     make
> >     make check
> >     bin/R
> >     # subsequent testing within R
> > 
> > if you happen to consider this patch for a commit, please be sure to
> > examine and test it carefully first.
> > 
> > vQ
> > 
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>