[R] uniq -c

William Dunlap wdunlap at tibco.com
Thu Oct 18 00:23:17 CEST 2012


In addition, adding a factor method for isFirstInRun speeds it up on
long factor variables by c. 60%.

isFirstInRun.factor <- function(x)isFirstInRun(as.integer(x))

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of William Dunlap
> Sent: Wednesday, October 17, 2012 2:11 PM
> To: sds at gnu.org; r-help at r-project.org
> Subject: Re: [R] uniq -c
> 
> Note that the relative speeds of these, which all use basically the same run-length-
> encoding
> algorithm, depend on the nature of the dataset.  I made a million row data.frame with
> 10,000
> unique users, 26 unique countries, and 6 unique languages with c. 3/4 million unique
> rows.  Then the times for methods 1, 2, and 3 were 0.7, 6.2, and 10.5 seconds,
> respectively.  With a million row data.frame with 100, 10, and 4 unique users, countries,
> and languages, with 4000 unique rows, the times were 0.3, 1.4, and 0.7.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> > Of Sam Steingold
> > Sent: Wednesday, October 17, 2012 12:58 PM
> > To: r-help at r-project.org
> > Subject: Re: [R] uniq -c
> >
> > > * Sam Steingold <fqf at tah.bet> [2012-10-16 11:03:27 -0400]:
> > >
> > > I need an analogue of "uniq -c" for a data frame.
> >
> > Summary of options:
> >
> > 1. William:
> >
> > isFirstInRun <- function(x) UseMethod("isFirstInRun")
> > isFirstInRun.default <- function(x) c(TRUE, x[-1] != x[-length(x)])
> > isFirstInRun.data.frame <- function(x) {
> >   stopifnot(ncol(x)>0)
> >   retval <- isFirstInRun(x[[1]])
> >   for(column in x) {
> >     retval <- retval | isFirstInRun(column)
> >   }
> >   retval
> > }
> > row.count.1 <- function (x) {
> >   i <- which(isFirstInRun(x))
> >   data.frame(x[i,], count=diff(c(i, 1L+nrow(x))))
> > }
> >
> > 147 seconds
> >
> > 2. http://orgmode.org/worg/org-contrib/babel/examples/Rpackage.html#sec-6-1
> > row.count.2 <- function (x) {
> >   equal.to.previous <- rowSums( x[2:nrow(x),] != x[1:(nrow(x)-1),] )==0
> >   tf.runs <- rle(equal.to.previous)
> >   counts <- c(1, unlist(mapply(function(x,y) if (y) x+1 else (rep(1,x)),
> >                                tf.runs$length, tf.runs$value)))
> >   counts <- counts[ c( diff( counts ) <= 0, TRUE ) ]
> >   unique.rows <- which( c(TRUE, !equal.to.previous ) )
> >   cbind(x[ unique.rows, ,drop=FALSE ], counts)
> > }
> >
> > 136 seconds
> >
> > 3. Micael: paste/strsplit
> >
> > row.count.3 <- function (x) {
> >   pa <- do.call(paste,x)
> >   rl <- rle(p)
> >   sp <- strsplit(as.character(rl$values)," ")
> >   data.frame(user = sapply(sp,"[",1),
> >              country = sapply(sp,"[",2),
> >              language = sapply(sp,"[",3),
> >              count = rl$length)
> > }
> >
> > here I know the columns and rely on absense of spaces in values.
> >
> > 27 seconds.
> >
> > Thanks to all who answered.
> >
> > --
> > Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
> > http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
> > http://thereligionofpeace.com http://ffii.org http://camera.org
> > A slave dreams not of Freedom, but of owning his own slaves.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list