[R] Improving data processing efficiency
Patrick Burns
pburns at pburns.seanet.com
Fri Jun 6 17:44:26 CEST 2008
One thing that is likely to speed the code significantly
is if you create 'result' to be its final size and then
subscript into it. Something like:
result[i, ] <- bestpeer
(though I'm not sure if 'i' is the proper index).
Patrick Burns
patrick at burns-stat.com
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and "A Guide for the Unwilling S User")
Daniel Folkinshteyn wrote:
> Anybody have any thoughts on this? Please? :)
>
> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>> Hi everyone!
>>
>> I have a question about data processing efficiency.
>>
>> My data are as follows: I have a data set on quarterly institutional
>> ownership of equities; some of them have had recent IPOs, some have
>> not (I have a binary flag set). The total dataset size is 700k+ rows.
>>
>> My goal is this: For every quarter since issue for each IPO, I need
>> to find a "matched" firm in the same industry, and close in market
>> cap. So, e.g., for firm X, which had an IPO, i need to find a matched
>> non-issuing firm in quarter 1 since IPO, then a (possibly different)
>> non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing
>> firm (there are about 8300 of these).
>>
>> Thus it seems to me that I need to be doing a lot of data selection
>> and subsetting, and looping (yikes!), but the result appears to be
>> highly inefficient and takes ages (well, many hours). What I am
>> doing, in pseudocode, is this:
>>
>> 1. for each quarter of data, getting out all the IPOs and all the
>> eligible non-issuing firms.
>> 2. for each IPO in a quarter, grab all the non-issuers in the same
>> industry, sort them by size, and finally grab a matching firm closest
>> in size (the exact procedure is to grab the closest bigger firm if
>> one exists, and just the biggest available if all are smaller)
>> 3. assign the matched firm-observation the same "quarters since
>> issue" as the IPO being matched
>> 4. rbind them all into the "matching" dataset.
>>
>> The function I currently have is pasted below, for your reference. Is
>> there any way to make it produce the same result but much faster?
>> Specifically, I am guessing eliminating some loops would be very
>> good, but I don't see how, since I need to do some fancy footwork for
>> each IPO in each quarter to find the matching firm. I'll be doing a
>> few things similar to this, so it's somewhat important to up the
>> efficiency of this. Maybe some of you R-fu masters can clue me in? :)
>>
>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>
>> ========== my function below ===========
>>
>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
>> quarters_since_issue=40) {
>>
>> result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
>> cheaper, so typecast the result to matrix
>>
>> colnames = names(tfdata)
>>
>> quarterends = sort(unique(tfdata$DATE))
>>
>> for (aquarter in quarterends) {
>> tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>
>> tfdata_quarter_fitting_nonissuers = tfdata_quarter[
>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &
>> (tfdata_quarter$IPO.Flag == 0), ]
>> tfdata_quarter_ipoissuers = tfdata_quarter[
>> tfdata_quarter$IPO.Flag == 1, ]
>>
>> for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>> arow = tfdata_quarter_ipoissuers[i,]
>> industrypeers = tfdata_quarter_fitting_nonissuers[
>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>> industrypeers = industrypeers[
>> order(industrypeers$Market.Cap.13f), ]
>> if ( nrow(industrypeers) > 0 ) {
>> if ( nrow(industrypeers[industrypeers$Market.Cap.13f
>> >= arow$Market.Cap.13f, ]) > 0 ) {
>> bestpeer =
>> industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
>> }
>> else {
>> bestpeer = industrypeers[nrow(industrypeers),]
>> }
>> bestpeer$Quarters.Since.IPO.Issue =
>> arow$Quarters.Since.IPO.Issue
>>
>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
>> bestpeer$PERMNO] = 1
>> result = rbind(result, as.matrix(bestpeer))
>> }
>> }
>> #result = rbind(result, tfdata_quarter)
>> print (aquarter)
>> }
>>
>> result = as.data.frame(result)
>> names(result) = colnames
>> return(result)
>>
>> }
>>
>> ========= end of my function =============
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list