[R] Fast ave for sorted data?

Sun Feb 15 20:08:37 CET 2009

On Sun, 15 Feb 2009, Zhou Fang wrote:

> Hi,
>
> This is probably really obvious, by I can't seem to find anything on it.
>
> Is there a fast version of ave for when the data is already sorted in terms 
> of the factor, or if the breaks are already known?
>

If all you want are means, you can use rle() and colMeans() to good 
effect:

foo2 <- 
function (x,y)
{

 	reps <- rle(x)$lengths
 	lens <- rep(reps,reps)
 	uniqLens <- unique(lens)
 	for (i in uniqLens[ uniqLens != 1]){
 		y[ lens == i] <-
 			rep( colMeans(matrix(y[ lens == i], nr=i)), each=i)
 		}
 	y

}

> x <- sort( round( runif(100000, 0 , 1 ), 5) )
> y <- sample(1000000,100000)
> all.equal(ave(y,x),foo2(x,y))
[1] TRUE
> system.time(foo2(x,y))
    user  system elapsed
   0.087   0.029   0.117
> system.time(ave(y,x))
    user  system elapsed
   1.933   0.030   1.980
>

If, as in your example, a substantial fraction of the X's are unique, and 
if you want to generalize to more than means, then you can still gain a 
lot by treating the unique and non-unique values separately like this:

foo <- 
function (x,y)
{

 	reps <- rle(x)$lengths
 	len.not.1 <- rep(reps,reps) != 1
 	y[ len.not.1] <- ave( y[ len.not.1], x[ len.not.1 ])
 	y

}

> y <- sample(1000000,100000)
> x <- sort( round( runif(100000, 0 , 2 ), 5) )
> system.time(foo(x,y))
    user  system elapsed
   0.577   0.027   0.628
> system.time(ave(y,x))
    user  system elapsed
   2.513   0.038   2.545
> table(table(x))

     1     2     3     4     5     6
60526 15161  2578   318    28     1

And if neither of these is quite good enough, a line or two of C code 
should do the trick. See package 'inline'.

HTH,

Chuck

> Basically, I have:
> X = 0.1, 0.2, 0.32, 0.32, 0.4, 0.56, 0.56, 0.7...
> Y = 223, 434, 343, 544, 231.... etc
> of the same, admittedly large length.
>
> Now note that some of the values of X are repeated. What I want to do is, for 
> those X that are repeated, take the corresponding values of Y and change them 
> to the average for that particular X.
>
> So, ave(Y,X) will work. But it's very slow, and certainly not suited to my 
> problem, where Y changes and X stays the same and I need to repeatedly 
> recalculate the averaging of Y. Ave also does not take take advantage of the 
> sorting of the data.
>
> So, is there an alternative? (Presumeably avoiding loops.)
>
> Thanks,
>
> Zhou Fang
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901