[R] Tagging identical rows of a matrix

Gabor Grothendieck ggrothendieck at myway.com
Sat May 15 05:34:37 CEST 2004


OK. Good point.  I have revised the interaction solution (which is
now unfortunately not as short) but it is nearly an order of
magnitude faster than the other two using 5 columns.  It is solution 
f4, below:

R> set.seed(1)
R> mat <- matrix(sample(20,100000,rep=T),nc=5)
R> 
R> f0 <- function(mat) {
+ mat2 <- apply(mat, 1, paste, collapse=":")
+ match(mat2, unique(mat2))
+ }
R> 
R> f1 <- function(mat) { 
+ z <- apply(mat, 1, paste, collapse=":")
+ as.numeric(factor(z,levels=unique(z)))
+ }
R> 
R> f4 <- function(mat) {
+ z <- apply(mat,2,factor)
+ as.numeric(interaction(z %*% ((max(z)+1)^(seq(ncol(z))-1)),drop=T))
+ }
R> 
R> 
R> invisible(gc()); system.time(z0 <- f0(mat))
[1] 2.05 0.00 2.17   NA   NA
R> invisible(gc()); system.time(z1 <- f1(mat))
[1] 2.22 0.01 2.37   NA   NA
R> invisible(gc()); system.time(z4 <- f4(mat))
[1] 0.26 0.00 0.30   NA   NA
R> 
R> all.equal(z0,z1)
[1] TRUE
R> all.equal(z0,z4)
[1] TRUE
R> all.equal(z4,z1)
[1] TRUE
R> 
R> 



Liaw, Andy <andy_liaw <at> merck.com> writes:

: 
: The problem with interaction() is that it doesn't scale with increasing
: number of columns:
: 
: > set.seed(1)
: > mat2 <- matrix(sample(20,5e4,rep=T), 1e4)
: > invisible(gc()); system.time(z0 <- f0(mat2))
: [1] 1.58 0.01 1.85   NA   NA
: > invisible(gc()); system.time(z1 <- f1(mat2))
: [1] 1.57 0.00 1.66   NA   NA
: > invisible(gc()); system.time(z2 <- f2g(mat2))
: [1] 34.14  0.60 57.45    NA    NA
: 
: [f2g is the slightly modified version of f2 to allow for any number of
: columns:
: f2g <- function(mat) as.numeric(interaction(as.data.frame(mat), drop=T))]
: 
: With 10 columns in the matrix, f0 and f1 ran fine in under 10 seconds, but
: f2g started thrashing, and ran out of memory after a while.  If you look at
: how interaction() is written you'll quickly see why...
: 
: Andy
: 
: > From: Gabor Grothendieck
: > 
: > Waichler, Scott R <Scott.Waichler <at> pnl.gov> writes:
: > 
: > > 
: > > Thanks to all of you who responded to my help request.
: > > Here is the very efficient upshot of your advice:
: > > 
: > > > mat2 <- apply(mat, 1, paste, collapse=":")
: > > > vec <- match(mat2, unique(mat2))
: > > > vec
: > > [1] 1 2 1 1 2 3
: > > 
: > > 
: > > P.S.  I found that Andy Liaw's method didn't preserve the
: > > index order that I wanted; it yields
: > > 
: > > 2 3 2 2 3 1
: > > 
: > > To get the order of integers I was looking for required an
: > > invocation of unique:
: > > 
: > > as.numeric(factor(apply(mat, 1, paste, collapse=":"),
: > >                   levels=unique(apply(mat, 1, paste, 
: > collapse=":"))))
: > > 
: > > But the first method above is obviously cleaner and is twice
: > > as fast, only 9 seconds for a 100000 row matrix on an ordinary PC.  
: > 
: > The interaction solution gives an identical result, is shorter and
: > is one or two orders of magnitude faster.  Here is a 
: > comparison of the three:
: > 
: > R> set.seed(1)
: > R> mat <- matrix(sample(20,100000,rep=T),50000)
: > R> 
: > R> f0 <- function(mat) {
: > + mat2 <- apply(mat, 1, paste, collapse=":");
: > + match(mat2, unique(mat2))
: > + }
: > R> 
: > R> 
: > R> f1 <- function(mat) { z <- apply(mat, 1, paste, collapse=":")
: > + as.numeric(factor(z,levels=unique(z)))
: > + }
: > R> 
: > R> f2 <- function(mat) as.numeric(interaction(mat[,1],mat[,2],drop=T))
: > R> 
: > R> dummy <- gc(); system.time(z0 <- f0(mat))
: > [1] 5.24 0.02 5.52   NA   NA
: > R> dummy <- gc(); system.time(z1 <- f1(mat))
: > [1] 5.18 0.00 5.52   NA   NA
: > R> dummy <- gc(); system.time(z2 <- f2(mat))
: > [1] 0.1 0.0 0.1  NA  NA
: > R> all.equal(z0,z1)
: > [1] TRUE
: > R> all.equal(z0,z2)
: > [1] TRUE
: > R> all.equal(z2,z1)
: > [1] TRUE
: > 
: > ______________________________________________
: > R-help <at> stat.math.ethz.ch mailing list
: > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
: > PLEASE do read the posting guide! 
: > http://www.R-project.org/posting-guide.html
: > 
: >
: 
: ______________________________________________
: R-help <at> stat.math.ethz.ch mailing list
: https://www.stat.math.ethz.ch/mailman/listinfo/r-help
: PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
: 
:




More information about the R-help mailing list