[R] Speed Advice for R --- avoid data frames

Uwe Ligges ligges at statistik.tu-dortmund.de
Sun Jul 3 18:19:22 CEST 2011

On 02.07.2011 21:35, ivo welch wrote:
> hi uwe---thanks for the clarification.  of course, my example should always
> be done in vectorized form.  I only used it to show how iterative access
> compares in the simplest possible fashion.<100 accesses per seconds is
> REALLY slow, though.
> I don't know R internals and the learning curve would be steep.  moreover,
> there is no guarantee that changes I would make would be accepted.  so, I
> cannot do this.
> however, for an R expert, this should not be too difficult.  conceptually,
> if data frame element access primitives are create/write/read/destroy in the
> code, then it's truly trivial.  just add a matrix (dim the same as the data
> frame) of byte pointers to point at the storage upon creation/change time.
>   this would be quick-and-dirty.  for curiosity, do you know which source
> file has the data frame internals?  maybe I will get tempted anyway if it is
> simple enough.

I think you should start to look at the mechanisms to construct 
data.frames (such as data.frame) and learn that data.frames are special 
lists. Then you may want to look at the differences between the 
.Primitive("[") and .Primitive("[<-") used for vectors (including 
vectors with dim attributes such as matrixes) and the correspoding 
methods for data.frames: "[<-.data.frame" and "[.data.frame".

After that, I doubt you want to improve further on. Note also that 
data.frames can be pretty large and you really do not want to store a 
matrix of pointers as large as the data.frame. People working witrh 
large data.frames won't be happy with such a suggestion.

If you want to follow up, I'd suggest to move the thread to R-devel 
where it seems to be more appropriate.


> (a more efficient but more involved way to do this would be to store a data
> frame internally always as a matrix of data pointers, but this would
> probably require more surgery.)
> It is also not as important for me, as it is for others...to give a good
> impression to those that are not aware of the tradeoffs---which is most
> people considering to adopt R.
> /iaw
> ----
> Ivo Welch (ivo.welch at gmail.com)
> 2011/7/2 Uwe Ligges<ligges at statistik.tu-dortmund.de>
>> Some comments:
>> the comparison matrix rows vs. matrix columns is incorrect: Note that R has
>> lazy evaluation, hence you construct your matrix in the timing for the rows
>> and it is already constructed in the timing for the columns, hence you want
>> to use:
>>   M<- matrix( rnorm(C*R), nrow=R )
>>   D<- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
>>   example(M)
>>   example(D)
>> Further on, you are correct with you statement that data.frame indexing is
>> much slower, but if you can store your data in matrix form, just go on as it
>> is.
>> I doubt anybody is really going to make the index operation you cited
>> within a loop. Then, with a data.frame, I can live with many vectorized
>> replacements again:
>>> system.time(D[,20]<- sqrt(abs(D[,20])) + rnorm(1000))
>>    user  system elapsed
>>    0.01    0.00    0.01
>>> system.time(D[20,]<- sqrt(abs(D[20,])) + rnorm(1000))
>>    user  system elapsed
>>    0.51    0.00    0.52
>> OK, it would be nice to do that faster, but this is not easy. I think R
>> Core is happy to see contributions to make it faster without breaking
>> existing features.
>> Best wishes,
>> Uwe
>> On 02.07.2011 20:35, ivo welch wrote:
>>> This email is intended for R users that are not that familiar with R
>>> internals and are searching google about how to speed up R.
>>> Despite common misperception, R is not slow when it comes to iterative
>>> access.  R is fast when it comes to matrices.  R is very slow when it
>>> comes to iterative access into data frames.  Such access occurs when a
>>> user uses "data$varname[index]", which is a very common operation.  To
>>> illustrate, run the following program:
>>> R<- 1000; C<- 1000
>>> example<- function(m) {
>>>    cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<-
>>> sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
>>>    cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<-
>>> sqrt(abs(m[20,c])) + rnorm(1)), "\n")
>>>    if (is.data.frame(m)) { cat("df: columns as names: ");
>>> cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) +
>>> rnorm(1)), "\n") }
>>> }
>>> cat("\n**** Now as matrix\n")
>>> example( matrix( rnorm(C*R), nrow=R ) )
>>> cat("\n**** Now as data frame\n")
>>> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )
>>> The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
>>> with ample RAM:
>>> matrix, columns: 0.01s
>>> matrix, rows: 0.175s
>>> data frame, columns: 53s
>>> data frame, rows: 56s
>>> data frame, names: 58s
>>> Data frame access is about 5,000 times slower than matrix column
>>> access, and 300 times slower than matrix row access.  R's data frame
>>> operational speed is an amazing 40 data accesses per seconds.  I have
>>> not seen access numbers this low for decades.
>>> How to avoid it?  Not easy.  One way is to create multiple matrices,
>>> and group them as an object.  of course, this loses a lot of features
>>> of R.  Another way is to copy all data used in calculations out of the
>>> data frame into a matrix, do the operations, and then copy them back.
>>> not ideal, either.
>>> In my opinion, this is an R design flow.  Data frames are the
>>> fundamental unit of much statistical analysis, and should be fast.  I
>>> think R lacks any indexing into data frames.  Turning on indexing of
>>> data frames should at least be an optional feature.
>>> I hope this message post helps others.
>>> /iaw
>>> ----
>>> Ivo Welch (ivo.welch at gmail.com)
>>> http://www.ivo-welch.info/
>>> ______________________________**________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>>> PLEASE do read the posting guide http://www.R-project.org/**
>>> posting-guide.html<http://www.R-project.org/posting-guide.html>
>>> and provide commented, minimal, self-contained, reproducible code.
> 	[[alternative HTML version deleted]]
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list