[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?
Kevin B. Hendricks
kevin.hendricks at sympatico.ca
Mon Jul 31 23:19:52 CEST 2006
Hi Thomas,
Here is a comparison of performance times from my own igroupSums
versus using split and rowsum:
> x <- rnorm(2e6)
> i <- rep(1:1e6,2)
>
> unix.time(suma <- unlist(lapply(split(x,i),sum)))
[1] 8.188 0.076 8.263 0.000 0.000
>
> names(suma)<- NULL
>
> unix.time(sumb <- igroupSums(x,i))
[1] 0.036 0.000 0.035 0.000 0.000
>
> all.equal(suma, sumb)
[1] TRUE
>
> unix.time(sumc <- rowsum(x,i))
[1] 0.744 0.000 0.742 0.000 0.000
>
> sumc <- sumc[,1]
> names(sumc)<-NULL
> all.equal(suma,sumc)
[1] TRUE
So my implementation of igroupSums is faster and already handles NA.
I also have implemented igroupMins, igroupMaxs, igroupAnys,
igroupAlls, igroupCounts, igroupMeans, and igroupRanges.
The igroup functions I implemented do not handle weights yet but do
handle NAs properly.
Assuming I clean them up, is anyone in the R developer group interested?
Or would you rather I instead extend the rowsum appropach to create
rowcount, rowmax, rowmin, rowcount, etc using a hash function approach.
All of these approaches simply use differently ways to map group
codes to integers and then do the functions the same.
Thanks,
Kevin
More information about the R-devel
mailing list