[R] For loop gets exponentially slower as dataset gets larger...

bogdan romocea br44114 at gmail.com
Tue Jan 3 19:49:56 CET 2006


Your 2-million loop is overkill, because apparently in the (vast)
majority of cases you don't need to loop at all. You could try
something like this:
1. Split the price by id, e.g.
price.list <- split(price,id)
For each id,
2a. When price is not NA, assign it to next price _without_ using a
for loop - e.g.
next.price[!is.na(price)] <- price[!is.na(price)]
2b. Use a for loop only when price is NA, but even then work with
vectors as much as you can, for example (untested)
for (i in setdiff(which(is.na(price)),length(price))) {
	remaining.prices <- price[(i+1):length(price)]
	of.interest <- head(remaining.prices[!is.na(remaining.prices)],1)
	if (class(of.interest) == "logical") next.price[i] <- NA else
next.price[i] <- of.interest
	}
To run (2a) and (2b) you could use lapply(); to paste the bits
together try do.call("rbind",price.list). You might also want to take
a look at ?Rprof and check the archives for efficiency suggestions.


> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of r user
> Sent: Tuesday, January 03, 2006 11:59 AM
> To: rhelp
> Subject: [R] For loop gets exponentially slower as dataset
> gets larger...
>
>
> I am running R 2.1.1 in a Microsoft Windows XP environment.
>
>   I have a matrix with three vectors ("columns") and ~2
> million "rows".  The three vectors are date_, id, and price.
> The data is ordered (sorted) by code and date_.
>
>   (The matrix contains daily prices for several thousand
> stocks, and has ~2 million "rows". If a stock did not trade
> on a particular date, its price is set to "NA")
>
>   I wish to add a fourth vector that is "next_price". ("Next
> price" is the current price as long as the current price is
> not "NA".  If the current price is NA, the "next_price" is
> the next price that the security with this same ID trades.
> If the stock does not trade again,  "next_price" is set to NA.)
>
>   I wrote the following loop to calculate next_price.  It
> works as intended, but I have one problem.  When I have only
> 10,000 rows of data, the calculations are very fast.
> However, when I run the loop on the full 2 million rows, it
> seems to take ~ 1 second per row.
>
>   Why is this happening?  What can I do to speed the
> calculations when running the loop on the full 2 million rows?
>
>   (I am not running low on memory, but I am maxing out my CPU at 100%)
>
>   Here is my code and some sample data:
>
>   data<- data[order(data$code,data$date_),]
>   l<-dim(data)[1]
>   w<-3
>   data[l,w+1]<-NA
>
>   for (i in (l-1):(1)){
>
> data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[
> i,2]==data[i+1,2],data[i+1,w+1],NA))
>   }
>
>
>   date      id         price     next_price
>   6/24/2005        1635    444.7838         444.7838
>   6/27/2005        1635    448.4756         448.4756
>   6/28/2005        1635    455.4161         455.4161
>   6/29/2005        1635    454.6658         454.6658
>   6/30/2005        1635    453.9155         453.9155
>   7/1/2005          1635    453.3153         453.3153
>   7/4/2005          1635    NA      453.9155
>   7/5/2005          1635    453.9155         453.9155
>   7/6/2005          1635    453.0152         453.0152
>   7/7/2005          1635    452.8651         452.8651
>   7/8/2005          1635    456.0163         456.0163
>   12/19/2005      1635    442.6982         442.6982
>   12/20/2005      1635    446.5159         446.5159
>   12/21/2005      1635    452.4714         452.4714
>   12/22/2005      1635    451.074           451.074
>   12/23/2005      1635    454.6453         454.6453
>   12/27/2005      1635    NA      NA
>   12/28/2005      1635    NA      NA
>   12/1/2003        1881    66.1562           66.1562
>   12/2/2003        1881    64.9192           64.9192
>   12/3/2003        1881    66.0078           66.0078
>   12/4/2003        1881    65.8098           65.8098
>   12/5/2003        1881    64.1275           64.1275
>   12/8/2003        1881    64.8697           64.8697
>   12/9/2003        1881    63.5337           63.5337
>   12/10/2003      1881    62.9399           62.9399
>
> 		
> ---------------------------------
>
> 	[[alternative HTML version deleted]]
>
>




More information about the R-help mailing list