[R] Why are big data.frames slow? What can I do to get it fas ter?

Mon Oct 7 16:46:49 CEST 2002

Extracting from data frame one element at a time the way you did is
expensive.  I.e., test[i, 6] is slower than test$whatever[i].

As an example:

> dat <- data.frame(a = sample(LETTERS, 1e6, replace=TRUE), b=1:1e6,
+                   c=rep("A", 1e6))
> dat$a <- as.character(dat$a)
> dat$c <- as.character(dat$c)
> 
> system.time(
+ for(i in 1:10) {
+   dat[i, 3] <- paste(dat[i, 1], "-", dat[i, 2], sep="")
+ }
+ )
[1] 26.17  0.13 26.67    NA    NA
> 
> system.time(
+ for(i in 1:10) {
+   dat$c[i] <- paste(dat$a[i], "-", dat$b[i], sep="")
+ }
+ )
[1] 0.16 0.00 0.16   NA   NA

HTH,
Andy

> -----Original Message-----
> From: Marcus Jellinghaus [mailto:Marcus_Jellinghaus at gmx.de]
> Sent: Monday, October 07, 2002 7:09 AM
> To: Uwe Ligges
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] Why are big data.frames slow? What can I do to get it
> faster?
> 
> 
> First I want to say "thank you" to everybody who replied.
> I understand that vectorized operations instead of the loop 
> are faster.
> I also made sure not to use factors.
> 
> Since the loop runs 100times in my example, the loop should 
> only take the
> time of the vectorized operation mutliplied by 100.
> But the loop takes about 10 minutes, the  vectorized 
> operation takes about 3
> seconds. (See below)
> Why that? Shouldn´t the loop take max 100*3seconds = 5 minutes?
> 
> I´m interested in that because I think that I will have 
> computations that
> are easily vectorizable(like this example) and that I will 
> have computations
> that are not/very difficult vectorizable.
> 
> Marcus Jellinghaus
> 
> 
> > print(dim(test)[1])
> [1] 500000
> > Sys.time()
> [1] "2002-10-07 06:17:33 Eastern Sommerzeit"
> > test[1:100,6] = paste(test[1:100,2],"-",test[1:100,3], sep = "")
> > Sys.time()
> [1] "2002-10-07 06:17:35 Eastern Sommerzeit"
> 
> [..]
> 
> > print(dim(test)[1])
> [1] 500000
> > Sys.time()
> [1] "2002-10-07 06:05:29 Eastern Sommerzeit"
> > for(i in 1:100) {
> +   test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
> + }
> > Sys.time()
> [1] "2002-10-07 06:15:17 Eastern Sommerzeit"
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Uwe Ligges [mailto:ligges at statistik.uni-dortmund.de]
> Gesendet: Sunday, October 06, 2002 1:58 PM
> An: Marcus Jellinghaus
> Cc: r-help at stat.math.ethz.ch
> Betreff: Re: [R] Why are big data.frames slow? What can I do to get it
> faster?
> 
> 
> Marcus Jellinghaus wrote:
> >
> > Hello,
> >
> > I´m quite new to this list.
> > I have a high frequency-dataset with more than 500.000 records.
> > I want to edit a data.frame "Test". My small programm runs 
> fine with a
> small
> > part of the dataset (just 100 records), but it is very slow 
> with a huge
> > dataset. Of course it get´s slower with more records, but 
> when I change
> just
> > the size of the frame and keep the number of edited records 
> fixed, I see
> > that it is also getting slower.
> >
> > Here is my program:
> >
> > print(dim(test)[1])
> > Sys.time()
> > for(i in 1:100) {
> >   test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
> > }
> > Sys.time()
> >
> > I connect 2 currency symbols to a currency pair.
> > I always calculate only for the first 100 lines.
> > WHen I load just 100 lines in the data.frame "test", it 
> takes 1 second.
> > When I load 1000 lines, editing 100 lines takes 2 seconds,
> > 10,000 lines loaded and 100 lines editing takes 5 seconds,
> > 100,000 lines loaded and editing 100 lines takes 31 seconds,
> > 500,000 lines loaded and editing 100 lines takes 11 minutes(!!!).
> >
> > My computer has 1 GB Ram, so that shouldn´t be the reason.
> >
> > Of course, I could work with many small data.frames instead 
> of one big,
> but
> > the program above is just the very first step and so I don´t want to
> split.
> >
> > Is there a way to edit big data.frames without waiting for 
> a long time?
> 
> Well, the point is, I guess, to address elements in a large 
> data.frame,
> which reasonably takes much more time than in a small one.
> 
> Maybe it's an idea to use vectorized operations instead of the loop,
> which is preferable, if your computation is easy vectorizable 
> without a
> big penalty of memory consumption:
> 
>  test[1:100, 6] <- paste(test[1:100, 2], "-", test[1:100, 3], 
> sep = "")
> or
>  test[ , 6] <- paste(test[ , 2], "-", test[ , 3], sep = "")
> for the whole data.frame.
> 
> Uwe Ligges
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-.-.-.-.-.-.-
> r-help mailing list -- Read 
> http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: 
> r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._._._._._._._
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named in this message.  If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it.

==============================================================================

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._