[R] Improving data processing efficiency

Thu Jun 5 20:09:23 CEST 2008

Hi everyone!

I have a question about data processing efficiency.

My data are as follows: I have a data set on quarterly institutional 
ownership of equities; some of them have had recent IPOs, some have not 
(I have a binary flag set). The total dataset size is 700k+ rows.

My goal is this: For every quarter since issue for each IPO, I need to 
find a "matched" firm in the same industry, and close in market cap. So, 
e.g., for firm X, which had an IPO, i need to find a matched non-issuing 
firm in quarter 1 since IPO, then a (possibly different) non-issuing 
firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there 
are about 8300 of these).

Thus it seems to me that I need to be doing a lot of data selection and 
subsetting, and looping (yikes!), but the result appears to be highly 
inefficient and takes ages (well, many hours). What I am doing, in 
pseudocode, is this:

1. for each quarter of data, getting out all the IPOs and all the 
eligible non-issuing firms.
2. for each IPO in a quarter, grab all the non-issuers in the same 
industry, sort them by size, and finally grab a matching firm closest in 
size (the exact procedure is to grab the closest bigger firm if one 
exists, and just the biggest available if all are smaller)
3. assign the matched firm-observation the same "quarters since issue" 
as the IPO being matched
4. rbind them all into the "matching" dataset.

The function I currently have is pasted below, for your reference. Is 
there any way to make it produce the same result but much faster? 
Specifically, I am guessing eliminating some loops would be very good, 
but I don't see how, since I need to do some fancy footwork for each IPO 
in each quarter to find the matching firm. I'll be doing a few things 
similar to this, so it's somewhat important to up the efficiency of 
this. Maybe some of you R-fu masters can clue me in? :)

I would appreciate any help, tips, tricks, tweaks, you name it! :)

========== my function below ===========

fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
quarters_since_issue=40) {

     result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
cheaper, so typecast the result to matrix

     colnames = names(tfdata)

     quarterends = sort(unique(tfdata$DATE))

     for (aquarter in quarterends) {
         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]

         tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & 
(tfdata_quarter$IPO.Flag == 0), ]
         tfdata_quarter_ipoissuers = tfdata_quarter[ 
tfdata_quarter$IPO.Flag == 1, ]

         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
             arow = tfdata_quarter_ipoissuers[i,]
             industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
             industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]
             if ( nrow(industrypeers) > 0 ) {
                 if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= 
arow$Market.Cap.13f, ]) > 0 ) {
                     bestpeer = 
industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
                 }
                 else {
                     bestpeer = industrypeers[nrow(industrypeers),]
                 }
                 bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue

#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
bestpeer$PERMNO] = 1
                 result = rbind(result, as.matrix(bestpeer))
             }
         }
         #result = rbind(result, tfdata_quarter)
         print (aquarter)
     }

     result = as.data.frame(result)
     names(result) = colnames
     return(result)

}

========= end of my function =============