[R] uniq -c

William Dunlap wdunlap at tibco.com
Wed Oct 17 23:11:29 CEST 2012


Note that the relative speeds of these, which all use basically the same run-length-encoding
algorithm, depend on the nature of the dataset.  I made a million row data.frame with 10,000
unique users, 26 unique countries, and 6 unique languages with c. 3/4 million unique
rows.  Then the times for methods 1, 2, and 3 were 0.7, 6.2, and 10.5 seconds,
respectively.  With a million row data.frame with 100, 10, and 4 unique users, countries,
and languages, with 4000 unique rows, the times were 0.3, 1.4, and 0.7.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Sam Steingold
> Sent: Wednesday, October 17, 2012 12:58 PM
> To: r-help at r-project.org
> Subject: Re: [R] uniq -c
> 
> > * Sam Steingold <fqf at tah.bet> [2012-10-16 11:03:27 -0400]:
> >
> > I need an analogue of "uniq -c" for a data frame.
> 
> Summary of options:
> 
> 1. William:
> 
> isFirstInRun <- function(x) UseMethod("isFirstInRun")
> isFirstInRun.default <- function(x) c(TRUE, x[-1] != x[-length(x)])
> isFirstInRun.data.frame <- function(x) {
>   stopifnot(ncol(x)>0)
>   retval <- isFirstInRun(x[[1]])
>   for(column in x) {
>     retval <- retval | isFirstInRun(column)
>   }
>   retval
> }
> row.count.1 <- function (x) {
>   i <- which(isFirstInRun(x))
>   data.frame(x[i,], count=diff(c(i, 1L+nrow(x))))
> }
> 
> 147 seconds
> 
> 2. http://orgmode.org/worg/org-contrib/babel/examples/Rpackage.html#sec-6-1
> row.count.2 <- function (x) {
>   equal.to.previous <- rowSums( x[2:nrow(x),] != x[1:(nrow(x)-1),] )==0
>   tf.runs <- rle(equal.to.previous)
>   counts <- c(1, unlist(mapply(function(x,y) if (y) x+1 else (rep(1,x)),
>                                tf.runs$length, tf.runs$value)))
>   counts <- counts[ c( diff( counts ) <= 0, TRUE ) ]
>   unique.rows <- which( c(TRUE, !equal.to.previous ) )
>   cbind(x[ unique.rows, ,drop=FALSE ], counts)
> }
> 
> 136 seconds
> 
> 3. Micael: paste/strsplit
> 
> row.count.3 <- function (x) {
>   pa <- do.call(paste,x)
>   rl <- rle(p)
>   sp <- strsplit(as.character(rl$values)," ")
>   data.frame(user = sapply(sp,"[",1),
>              country = sapply(sp,"[",2),
>              language = sapply(sp,"[",3),
>              count = rl$length)
> }
> 
> here I know the columns and rely on absense of spaces in values.
> 
> 27 seconds.
> 
> Thanks to all who answered.
> 
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
> http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
> http://thereligionofpeace.com http://ffii.org http://camera.org
> A slave dreams not of Freedom, but of owning his own slaves.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list