[R] Tagging identical rows of a matrix
Gabor Grothendieck
ggrothendieck at myway.com
Sat May 15 05:34:37 CEST 2004
OK. Good point. I have revised the interaction solution (which is
now unfortunately not as short) but it is nearly an order of
magnitude faster than the other two using 5 columns. It is solution
f4, below:
R> set.seed(1)
R> mat <- matrix(sample(20,100000,rep=T),nc=5)
R>
R> f0 <- function(mat) {
+ mat2 <- apply(mat, 1, paste, collapse=":")
+ match(mat2, unique(mat2))
+ }
R>
R> f1 <- function(mat) {
+ z <- apply(mat, 1, paste, collapse=":")
+ as.numeric(factor(z,levels=unique(z)))
+ }
R>
R> f4 <- function(mat) {
+ z <- apply(mat,2,factor)
+ as.numeric(interaction(z %*% ((max(z)+1)^(seq(ncol(z))-1)),drop=T))
+ }
R>
R>
R> invisible(gc()); system.time(z0 <- f0(mat))
[1] 2.05 0.00 2.17 NA NA
R> invisible(gc()); system.time(z1 <- f1(mat))
[1] 2.22 0.01 2.37 NA NA
R> invisible(gc()); system.time(z4 <- f4(mat))
[1] 0.26 0.00 0.30 NA NA
R>
R> all.equal(z0,z1)
[1] TRUE
R> all.equal(z0,z4)
[1] TRUE
R> all.equal(z4,z1)
[1] TRUE
R>
R>
Liaw, Andy <andy_liaw <at> merck.com> writes:
:
: The problem with interaction() is that it doesn't scale with increasing
: number of columns:
:
: > set.seed(1)
: > mat2 <- matrix(sample(20,5e4,rep=T), 1e4)
: > invisible(gc()); system.time(z0 <- f0(mat2))
: [1] 1.58 0.01 1.85 NA NA
: > invisible(gc()); system.time(z1 <- f1(mat2))
: [1] 1.57 0.00 1.66 NA NA
: > invisible(gc()); system.time(z2 <- f2g(mat2))
: [1] 34.14 0.60 57.45 NA NA
:
: [f2g is the slightly modified version of f2 to allow for any number of
: columns:
: f2g <- function(mat) as.numeric(interaction(as.data.frame(mat), drop=T))]
:
: With 10 columns in the matrix, f0 and f1 ran fine in under 10 seconds, but
: f2g started thrashing, and ran out of memory after a while. If you look at
: how interaction() is written you'll quickly see why...
:
: Andy
:
: > From: Gabor Grothendieck
: >
: > Waichler, Scott R <Scott.Waichler <at> pnl.gov> writes:
: >
: > >
: > > Thanks to all of you who responded to my help request.
: > > Here is the very efficient upshot of your advice:
: > >
: > > > mat2 <- apply(mat, 1, paste, collapse=":")
: > > > vec <- match(mat2, unique(mat2))
: > > > vec
: > > [1] 1 2 1 1 2 3
: > >
: > >
: > > P.S. I found that Andy Liaw's method didn't preserve the
: > > index order that I wanted; it yields
: > >
: > > 2 3 2 2 3 1
: > >
: > > To get the order of integers I was looking for required an
: > > invocation of unique:
: > >
: > > as.numeric(factor(apply(mat, 1, paste, collapse=":"),
: > > levels=unique(apply(mat, 1, paste,
: > collapse=":"))))
: > >
: > > But the first method above is obviously cleaner and is twice
: > > as fast, only 9 seconds for a 100000 row matrix on an ordinary PC.
: >
: > The interaction solution gives an identical result, is shorter and
: > is one or two orders of magnitude faster. Here is a
: > comparison of the three:
: >
: > R> set.seed(1)
: > R> mat <- matrix(sample(20,100000,rep=T),50000)
: > R>
: > R> f0 <- function(mat) {
: > + mat2 <- apply(mat, 1, paste, collapse=":");
: > + match(mat2, unique(mat2))
: > + }
: > R>
: > R>
: > R> f1 <- function(mat) { z <- apply(mat, 1, paste, collapse=":")
: > + as.numeric(factor(z,levels=unique(z)))
: > + }
: > R>
: > R> f2 <- function(mat) as.numeric(interaction(mat[,1],mat[,2],drop=T))
: > R>
: > R> dummy <- gc(); system.time(z0 <- f0(mat))
: > [1] 5.24 0.02 5.52 NA NA
: > R> dummy <- gc(); system.time(z1 <- f1(mat))
: > [1] 5.18 0.00 5.52 NA NA
: > R> dummy <- gc(); system.time(z2 <- f2(mat))
: > [1] 0.1 0.0 0.1 NA NA
: > R> all.equal(z0,z1)
: > [1] TRUE
: > R> all.equal(z0,z2)
: > [1] TRUE
: > R> all.equal(z2,z1)
: > [1] TRUE
: >
: > ______________________________________________
: > R-help <at> stat.math.ethz.ch mailing list
: > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
: > PLEASE do read the posting guide!
: > http://www.R-project.org/posting-guide.html
: >
: >
:
: ______________________________________________
: R-help <at> stat.math.ethz.ch mailing list
: https://www.stat.math.ethz.ch/mailman/listinfo/r-help
: PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
:
:
More information about the R-help
mailing list