[Rd] extracting rows from a data frame by looping over the row names: performance issues
Greg Snow
Greg.Snow at intermountainmail.org
Mon Mar 5 17:07:56 CET 2007
The difference is in indexing by row number vs. indexing by row name.
It has long been known that names slow matricies down, some routines
make a copy of dimnames of a matrix, remove the dimnames, do the
computations with the matrix, then put the dimnames back on. This can
speed things up quite a bit in some circumstances. For your example,
indexing by number means jumping to a specific offset in the matrix,
indexing by name means searching through all the names and doing string
comparisons to find the match. A 300 fold difference in speed is not
suprising.
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
> -----Original Message-----
> From: Herve Pages [mailto:hpages at fhcrc.org]
> Sent: Friday, March 02, 2007 7:04 PM
> To: Greg Snow
> Cc: r-devel at r-project.org
> Subject: Re: [Rd] extracting rows from a data frame by
> looping over the row names: performance issues
>
> Hi Greg,
>
> Greg Snow wrote:
> > Your 2 examples have 2 differences and they are therefore
> confounded
> > in their effects.
> >
> > What are your results for:
> >
> > system.time(for (i in 1:100) {row <- dat[i, ] })
> >
> >
> >
>
> Right. What you suggest is even faster (and more simple):
>
> > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
> > dat <- as.data.frame(mat)
>
> > system.time(for (key in row.names(dat)[1:100]) { row <-
> dat[key, ] })
> user system elapsed
> 13.241 0.460 13.702
>
> > system.time(for (i in 1:100) { row <- sapply(dat,
> function(col) col[i]) })
> user system elapsed
> 0.280 0.372 0.650
>
> > system.time(for (i in 1:100) {row <- dat[i, ] })
> user system elapsed
> 0.044 0.088 0.130
>
> So apparently here extracting with dat[i, ] is 300 times
> faster than extracting with dat[key, ] !
>
> > system.time(for (i in 1:100) dat["1", ])
> user system elapsed
> 12.680 0.396 13.075
>
> > system.time(for (i in 1:100) dat[1, ])
> user system elapsed
> 0.060 0.076 0.137
>
> Good to know!
>
> Thanks a lot,
> H.
>
>
>
More information about the R-devel
mailing list