[R] Tagging identical rows of a matrix
Liaw, Andy
andy_liaw at merck.com
Sat May 15 03:20:21 CEST 2004
The problem with interaction() is that it doesn't scale with increasing
number of columns:
> set.seed(1)
> mat2 <- matrix(sample(20,5e4,rep=T), 1e4)
> invisible(gc()); system.time(z0 <- f0(mat2))
[1] 1.58 0.01 1.85 NA NA
> invisible(gc()); system.time(z1 <- f1(mat2))
[1] 1.57 0.00 1.66 NA NA
> invisible(gc()); system.time(z2 <- f2g(mat2))
[1] 34.14 0.60 57.45 NA NA
[f2g is the slightly modified version of f2 to allow for any number of
columns:
f2g <- function(mat) as.numeric(interaction(as.data.frame(mat), drop=T))]
With 10 columns in the matrix, f0 and f1 ran fine in under 10 seconds, but
f2g started thrashing, and ran out of memory after a while. If you look at
how interaction() is written you'll quickly see why...
Andy
> From: Gabor Grothendieck
>
> Waichler, Scott R <Scott.Waichler <at> pnl.gov> writes:
>
> >
> > Thanks to all of you who responded to my help request.
> > Here is the very efficient upshot of your advice:
> >
> > > mat2 <- apply(mat, 1, paste, collapse=":")
> > > vec <- match(mat2, unique(mat2))
> > > vec
> > [1] 1 2 1 1 2 3
> >
> >
> > P.S. I found that Andy Liaw's method didn't preserve the
> > index order that I wanted; it yields
> >
> > 2 3 2 2 3 1
> >
> > To get the order of integers I was looking for required an
> > invocation of unique:
> >
> > as.numeric(factor(apply(mat, 1, paste, collapse=":"),
> > levels=unique(apply(mat, 1, paste,
> collapse=":"))))
> >
> > But the first method above is obviously cleaner and is twice
> > as fast, only 9 seconds for a 100000 row matrix on an ordinary PC.
>
> The interaction solution gives an identical result, is shorter and
> is one or two orders of magnitude faster. Here is a
> comparison of the three:
>
> R> set.seed(1)
> R> mat <- matrix(sample(20,100000,rep=T),50000)
> R>
> R> f0 <- function(mat) {
> + mat2 <- apply(mat, 1, paste, collapse=":");
> + match(mat2, unique(mat2))
> + }
> R>
> R>
> R> f1 <- function(mat) { z <- apply(mat, 1, paste, collapse=":")
> + as.numeric(factor(z,levels=unique(z)))
> + }
> R>
> R> f2 <- function(mat) as.numeric(interaction(mat[,1],mat[,2],drop=T))
> R>
> R> dummy <- gc(); system.time(z0 <- f0(mat))
> [1] 5.24 0.02 5.52 NA NA
> R> dummy <- gc(); system.time(z1 <- f1(mat))
> [1] 5.18 0.00 5.52 NA NA
> R> dummy <- gc(); system.time(z2 <- f2(mat))
> [1] 0.1 0.0 0.1 NA NA
> R> all.equal(z0,z1)
> [1] TRUE
> R> all.equal(z0,z2)
> [1] TRUE
> R> all.equal(z2,z1)
> [1] TRUE
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>
More information about the R-help
mailing list