[R] data frame vs. matrix
William Dunlap
wdunlap at tibco.com
Mon Mar 17 00:36:59 CET 2014
Duncan's analysis suggests another way to do this:
extract the 'x' vector, operate on that vector in a loop,
then insert the result into the data.frame. I added
a df="quicker" option to your df argument and made the test
dataset deterministic so we could verify that the algorithms
do the same thing:
dumkoll <- function(n = 1000, df = TRUE){
dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n)))
if (identical(df, "quicker")) {
x <- dfr$x
for(i in 2:length(x)) {
x[i] <- x[i-1]
}
dfr$x <- x
} else if (df){
for (i in 2:NROW(dfr)){
# if (!(i %% 100)) cat("i = ", i, "\n")
dfr$x[i] <- dfr$x[i-1]
}
}else{
dm <- as.matrix(dfr)
for (i in 2:NROW(dm)){
# if (!(i %% 100)) cat("i = ", i, "\n")
dm[i, 1] <- dm[i-1, 1]
}
dfr$x <- dm[, 1]
}
dfr
}
Timings for 10^4, 2*10^4, and 4*10^4 show that the time is quadratic
in n for the df=TRUE case and close to linear in the other cases, with
the new method taking about 60% the time of the matrix method:
> n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4)
> sapply(n, function(n)system.time(dumkoll(n, df=FALSE))[1:3])
10k 20k 40k
user.self 0.11 0.22 0.43
sys.self 0.02 0.00 0.00
elapsed 0.12 0.22 0.44
> sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3])
10k 20k 40k
user.self 3.59 14.74 78.37
sys.self 0.00 0.11 0.16
elapsed 3.59 14.91 78.81
> sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3])
10k 20k 40k
user.self 0.06 0.12 0.26
sys.self 0.00 0.00 0.00
elapsed 0.07 0.13 0.27
I also timed the 2 faster cases for n=10^6 and the time still looks linear
in n, with vector approach still taking about 60% the time of the matrix
approach.
> system.time(dumkoll(n=10^6, df=FALSE))
user system elapsed
11.65 0.12 11.82
> system.time(dumkoll(n=10^6, df="quicker"))
user system elapsed
6.79 0.08 6.91
The results from each method are identical:
> identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE))
[1] TRUE
> identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker"))
[1] TRUE
If your data.frame has columns of various types, then as.matrix will
coerce them all to a common type (often character), so it may give
you the wrong result in addition to being unnecessarily slow.
Bill Dunlap
TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Duncan Murdoch
> Sent: Sunday, March 16, 2014 3:56 PM
> To: Göran Broström; r-help at r-project.org
> Subject: Re: [R] data frame vs. matrix
>
> On 14-03-16 2:57 PM, Göran Broström wrote:
> > I have always known that "matrices are faster than data frames", for
> > instance this function:
> >
> >
> > dumkoll <- function(n = 1000, df = TRUE){
> > dfr <- data.frame(x = rnorm(n), y = rnorm(n))
> > if (df){
> > for (i in 2:NROW(dfr)){
> > if (!(i %% 100)) cat("i = ", i, "\n")
> > dfr$x[i] <- dfr$x[i-1]
> > }
> > }else{
> > dm <- as.matrix(dfr)
> > for (i in 2:NROW(dm)){
> > if (!(i %% 100)) cat("i = ", i, "\n")
> > dm[i, 1] <- dm[i-1, 1]
> > }
> > dfr$x <- dm[, 1]
> > }
> > }
> >
> > --------------------
> > > system.time(dumkoll())
> >
> > user system elapsed
> > 0.046 0.000 0.045
> >
> > > system.time(dumkoll(df = FALSE))
> >
> > user system elapsed
> > 0.007 0.000 0.008
> > ----------------------
> >
> > OK, no big deal, but I stumbled over a data frame with one million
> > records. Then, with df = TRUE,
> > ----------------------------
> > user system elapsed
> > 44677.141 1271.544 46016.754
> > ----------------------------
> > This is around 12 hours.
> >
> > With df = FALSE, it took only six seconds! About 7500 time faster.
> >
> > I was really surprised by the huge difference, and I wonder if this is
> > to be expected, or if it is some peculiarity with my installation: I'm
> > running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
>
> I don't find it surprising. The line
>
> dfr$x[i] <- dfr$x[i-1]
>
> will be executed about a million times. It does the following:
>
> 1. Get a pointer to the x element of dfr. This requires R to look
> through all the names of dfr to figure out which one is "x".
>
> 2. Extract the i-1 element from it. Not particularly slow.
>
> 3. Get a pointer to the x element of dfr again. (R doesn't cache these
> things.)
>
> 4. Set the i element of it to a new value. This could require the
> entire column or even the entire dataframe to be copied, if R hasn't
> kept track of the fact that it is really being changed in place. In a
> complex assignment like that, I wouldn't be surprised if that took
> place. (In the matrix equivalent, it would be easier to recognize that
> it is safe to change the existing value.)
>
> Luke Tierney is making some changes in R-devel that might help a lot in
> cases like this, but I expect the matrix code will always be faster.
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list