[R] Improving data processing efficiency

Fri Jun 6 19:25:05 CEST 2008

i thought since the function code (which i provided in full) was pretty 
short, it would be reasonably easy to just read the code and see what 
it's doing.

but ok, so... i am attaching a zip file, with a small sample of the data 
set (tab delimited), and the function code, in a zip file (posting 
guidelines claim that "some archive formats" are allowed, i assume zip 
is one of them...

would appreciate your comments! :)

on 06/06/2008 12:05 PM Gabor Grothendieck said the following:
> Its summarized in the last line to r-help.  Note reproducible and
> minimal.
> 
> On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn <dfolkins at gmail.com> wrote:
>> i did! what did i miss?
>>
>> on 06/06/2008 11:45 AM Gabor Grothendieck said the following:
>>> Try reading the posting guide before posting.
>>>
>>> On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn <dfolkins at gmail.com>
>>> wrote:
>>>> Anybody have any thoughts on this? Please? :)
>>>>
>>>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>>>>> Hi everyone!
>>>>>
>>>>> I have a question about data processing efficiency.
>>>>>
>>>>> My data are as follows: I have a data set on quarterly institutional
>>>>> ownership of equities; some of them have had recent IPOs, some have not
>>>>> (I
>>>>> have a binary flag set). The total dataset size is 700k+ rows.
>>>>>
>>>>> My goal is this: For every quarter since issue for each IPO, I need to
>>>>> find a "matched" firm in the same industry, and close in market cap. So,
>>>>> e.g., for firm X, which had an IPO, i need to find a matched non-issuing
>>>>> firm in quarter 1 since IPO, then a (possibly different) non-issuing
>>>>> firm in
>>>>> quarter 2 since IPO, etc. Repeat for each issuing firm (there are about
>>>>> 8300
>>>>> of these).
>>>>>
>>>>> Thus it seems to me that I need to be doing a lot of data selection and
>>>>> subsetting, and looping (yikes!), but the result appears to be highly
>>>>> inefficient and takes ages (well, many hours). What I am doing, in
>>>>> pseudocode, is this:
>>>>>
>>>>> 1. for each quarter of data, getting out all the IPOs and all the
>>>>> eligible
>>>>> non-issuing firms.
>>>>> 2. for each IPO in a quarter, grab all the non-issuers in the same
>>>>> industry, sort them by size, and finally grab a matching firm closest in
>>>>> size (the exact procedure is to grab the closest bigger firm if one
>>>>> exists,
>>>>> and just the biggest available if all are smaller)
>>>>> 3. assign the matched firm-observation the same "quarters since issue"
>>>>> as
>>>>> the IPO being matched
>>>>> 4. rbind them all into the "matching" dataset.
>>>>>
>>>>> The function I currently have is pasted below, for your reference. Is
>>>>> there any way to make it produce the same result but much faster?
>>>>> Specifically, I am guessing eliminating some loops would be very good,
>>>>> but I
>>>>> don't see how, since I need to do some fancy footwork for each IPO in
>>>>> each
>>>>> quarter to find the matching firm. I'll be doing a few things similar to
>>>>> this, so it's somewhat important to up the efficiency of this. Maybe
>>>>> some of
>>>>> you R-fu masters can clue me in? :)
>>>>>
>>>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>>>>
>>>>> ========== my function below ===========
>>>>>
>>>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
>>>>> quarters_since_issue=40) {
>>>>>
>>>>>   result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
>>>>> cheaper, so typecast the result to matrix
>>>>>
>>>>>   colnames = names(tfdata)
>>>>>
>>>>>   quarterends = sort(unique(tfdata$DATE))
>>>>>
>>>>>   for (aquarter in quarterends) {
>>>>>       tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>>>>
>>>>>       tfdata_quarter_fitting_nonissuers = tfdata_quarter[
>>>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &
>>>>> (tfdata_quarter$IPO.Flag == 0), ]
>>>>>       tfdata_quarter_ipoissuers = tfdata_quarter[
>>>>> tfdata_quarter$IPO.Flag
>>>>> == 1, ]
>>>>>
>>>>>       for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>>>>           arow = tfdata_quarter_ipoissuers[i,]
>>>>>           industrypeers = tfdata_quarter_fitting_nonissuers[
>>>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>>>>           industrypeers = industrypeers[
>>>>> order(industrypeers$Market.Cap.13f), ]
>>>>>           if ( nrow(industrypeers) > 0 ) {
>>>>>               if ( nrow(industrypeers[industrypeers$Market.Cap.13f >=
>>>>> arow$Market.Cap.13f, ]) > 0 ) {
>>>>>                   bestpeer = industrypeers[industrypeers$Market.Cap.13f
>>>>>> = arow$Market.Cap.13f, ][1,]
>>>>>               }
>>>>>               else {
>>>>>                   bestpeer = industrypeers[nrow(industrypeers),]
>>>>>               }
>>>>>               bestpeer$Quarters.Since.IPO.Issue =
>>>>> arow$Quarters.Since.IPO.Issue
>>>>>
>>>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
>>>>> bestpeer$PERMNO] = 1
>>>>>               result = rbind(result, as.matrix(bestpeer))
>>>>>           }
>>>>>       }
>>>>>       #result = rbind(result, tfdata_quarter)
>>>>>       print (aquarter)
>>>>>   }
>>>>>
>>>>>   result = as.data.frame(result)
>>>>>   names(result) = colnames
>>>>>   return(result)
>>>>>
>>>>> }
>>>>>
>>>>> ========= end of my function =============
>>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>