[R] practical to loop over 2million rows?
David Winsemius
dwinsemius at comcast.net
Wed Oct 10 23:16:52 CEST 2012
On Oct 10, 2012, at 1:31 PM, Jay Rice wrote:
> New to R and having issues with loops. I am aware that I should use
> vectorization whenever possible and use the apply functions, however,
> sometimes a loop seems necessary.
>
> I have a data set of 2 million rows and have tried run a couple of loops of
> varying complexity to test efficiency. If I do a very simple loop such as
> add every item in a column I get an answer quickly.
>
> If I use a nested ifelse statement in a loop it takes me 13 minutes to get
> an answer on just 50,000 rows. I am aware of a few methods to speed up
> loops. Preallocating memory space and compute as much outside of the loop
> as possible (or use create functions and just loop over the function) but
> it seems that even with these speed ups I might have too much data to run
> loops. Here is the loop I ran that took 13 minutes. I realize I can
> accomplish the same goal using vectorization (and in fact did so).
You should describe what you want to do and you should learn to use the vectorized capabilities of R and leave the for-loops for process that really need them
>
> y<-numeric(length(x))
>
> for(i in 1:length(x))
>
> ifelse(!is.na(x[i]), y[i]<-x[i],
Instead :
y[!is.na(x)] <- x[!is.na(x)] # No loop.
>
> ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))
When you index outside the range of the length of x you get NA as a result. Furthermore you are setting y to be only a single element. So I think 'y' will be a single NA at the end of all this.
> strataID <- sample(1:2, 10, repl=TRUE)
> strataID
[1] 1 1 2 2 1 2 2 2 2 1
> for(i in 1:length(x)) {ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1])}
> y
[1] NA
There is no implicit indexing of the LHS of an assignment operation. How long is strataID? And why not do this inside a dataframe?
>
> Presumably, complicated loops would be more intensive than the nested if
> statement above. If I write more efficient loops time will come down but I
> wonder if I will ever be able to write efficient enough code to perform a
> complicated loop over 2 million rows in a reasonable time.
>
> Is it useless for me to try to do any complicated loops on 2 million rows,
> or if I get much better at programming in R will it be manageable even for
> complicated situations?
>
You will gain efficiency when you learn vectorization. And when you learn to test your code for correct behavior.
>
> Jay
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Alameda, CA, USA
More information about the R-help
mailing list