[R] practical to loop over 2million rows?
S Ellison
S.Ellison at LGCGroup.com
Thu Oct 11 15:28:24 CEST 2012
> If I use a nested ifelse statement in a loop it takes me 13
> minutes to get an answer on just 50,000 rows.
> ...
> ifelse(strataID[i+1]==strataID[i], y<-x[i+1], y<-x[i-1]))
maybe take a closer look at the ifelse help page and the examples?
First, ifelse is intended to be vectorized. If you nest it in a loop, you're effectively nesting a loop inside a loop. And by putting ifelse inside ifelse, you've done that twice. And then you've run the loops on vectors of length one, so 'twas all in vain...
Second, the two things after the condition in ifelse are not instructions, they are arguments to the function. Putting y<-something in as an argument means '(promise to) store something in a variable called y, and then pass y to the function'. You probably didn't mean that.
Third, ifelse returns a vector of the results; you're not using the return value for anything.
For a single 'if' that takes some action, you want 'if' and 'else' _separately_, not 'ifelse'
y<-length(x) #length() already returns a numeric value. So if you must do this with a loop, it would look more like
for(i in 1:length(x)+1) { #because x[i-1] wand x[i+1] won't be there for all i otherwise
if (!is.na(x[i])) , y[i]<-x[i]
if(strataID[i+1]==strataID[i]) y<-x[i+1] else y<-x[i] #I changed the second x index because I can't see why it differed from the strataID index
#or, using the fact that 'if' also returns something:
# y <- if(strataID[i+1]==strataID[i]) x[i+1] else x[i]
}
Finally, if you don't preallocate y at the length you want, R will have to move the whole of y to a new memory location with one more space every time you append something to it. There's a section on that in the R inferno. It's a really good way of slowing R down.
So let's try something else.
strataID <- sample(letters[1:3], 2000000, replace=T) #a nice long strata identifier with some matches likely
x <- rnorm(2000000) #some random numbers
x <- ifelse(x < -2, NA, x) #a few NA's now in x, though it does take a few seconds for the 2 million observations
i <- 1:(length(x)-1) #A long indexing vector with space for the last x[i+1]
y <- x #That puts all the NA's in the right place in y, allocates y and happens to put all the current values of x into y too.
system.time( y[i]<-ifelse( strataID[i+1]==strataID[i], x[i+1], x[i] ) )
#does the whole loop and stores it in the 'right' places in y -
# though it will foul up those NA's because of your x indexing. And incidentally it doesn't change the last y either
#On my allegedly 2GHz machine the systemt time result was 2.87 seconds for the 2 million 'rows'
#Incidentally, a look at what we ended up with:
data.frame(s=strataID, y=y)[1:30,]
#says you probably aren;t getting anything useful from the exercise other than a feel for what can go wrong with loops.
>
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
More information about the R-help
mailing list