[R] Thoughts for faster indexing
Steve Lianoglou
lianoglou.steve at gene.com
Tue Nov 26 21:23:51 CET 2013
Hi,
On Tue, Nov 26, 2013 at 11:41 AM, Noah Silverman <noahsilverman at ucla.edu> wrote:
> All interesting suggestions.
>
> I guess a better example of the code would have been a good idea. So,
> I'll put a relevant snippet here.
>
> Rows are cases. There are multiple cases for each ID, marked with a
> date. I'm trying to calculate a time recency weighted score for a
> covariate, added as a new column in the data.frame.
>
> So, for each row, I need to see which ID it belongs to, then get all the
> scores prior to this row's date, then compute the recency weighted summary.
>
> Right now, I do this in an obvious, but very very slow way.
>
> Here is my slow code:
> ======================
> for(i in 1:nrow(d)){
> for(j in which( d$id == d$id[i] & d$date[j] < d$date[i]) ){
> days_since = as.numeric( d$date[i] - d$date[j] )
> w <- exp( -days_since/decay )
> temp <- temp + w * as.numeric(d[j,'score'])
> wTemp <- wTemp + w
> }
>
> temp <- temp / wTemp
> d$newScore[i,] <- temp
> }
> ======================
>
> One immediate thought was to turn the "date" into an integer. That
> should save a few cycles of date math.
>
> I need to do this process for a bunch of scores. A grid search over
> different time decay levels might be nice. So any speedup to this
> routine will save me a ton of time.
>
> Ideas?
A few quick ones.
You had said you tried data.table and found it to be slow still -- my
guess is that you might not have used it correctly, so here is a rough
sketch of what to do.
Let's assume that your date is converted to some integer -- I will
leave that excercise to you :-) -- but it seems like you just want to
calculate number of (whole) days since an event that you have a record
for, so this should be (in principle) easy to do (if you really need
full power of "date math", data.table supports that as well).
Also you never "reset" your `temp` variable, so it looks like you are
carrying over `temp` from one `id` group to the next (and, while I
have no knowledge of your problem, I would imagine this is not what
you want to do)
Anyway some rough ideas to get you started:
R> d <- as.data.table(d)
R> setkeyv(d, c('id', 'date'))
Now records within each date are ordered from first to last.
The specifics of your decay score escape me a bit, eg. what is the
value of "days_since" for the first record of each id? I'll let you
figure that out, but in the non-edge cases, it looks like you can just
calculate "days since" by subtracting the current date from the date
recorded in the record before it. (Note that `.I` is special
data.table variable for the row number of a given record in the
original data.table):
d[, newScore := {
## handle edge case for first record w/in each `id` group
days_since <- date - d$date[.I -1]
w <- exp(-days_since / decay)
## ...
## Some other stuff you are doing here which I can't
## understand with temp ... then multiple the 'score' column
## for the given row by the your correctly calculated weight `w`
## for that row (whatever it might be).
w * score
}, by='id']
HTH,
-steve
--
Steve Lianoglou
Computational Biologist
Genentech
More information about the R-help
mailing list