[R] Improving data processing efficiency

Patrick Burns pburns at pburns.seanet.com
Fri Jun 6 20:03:44 CEST 2008


That is going to be situation dependent, but if you
have a reasonable upper bound, then that will be
much easier and not far from optimal.

If you pick the possibly too small route, then increasing
the size in largish junks is much better than adding
a row at a time.

Pat

Daniel Folkinshteyn wrote:
> thanks for the tip! i'll try that and see how big of a difference that 
> makes... if i am not sure what exactly the size will be, am i better 
> off making it larger, and then later stripping off the blank rows, or 
> making it smaller, and appending the missing rows?
>
> on 06/06/2008 11:44 AM Patrick Burns said the following:
>> One thing that is likely to speed the code significantly
>> is if you create 'result' to be its final size and then
>> subscript into it.  Something like:
>>
>>   result[i, ] <- bestpeer
>>
>> (though I'm not sure if 'i' is the proper index).
>>
>> Patrick Burns
>> patrick at burns-stat.com
>> +44 (0)20 8525 0696
>> http://www.burns-stat.com
>> (home of S Poetry and "A Guide for the Unwilling S User")
>>
>> Daniel Folkinshteyn wrote:
>>> Anybody have any thoughts on this? Please? :)
>>>
>>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>>>> Hi everyone!
>>>>
>>>> I have a question about data processing efficiency.
>>>>
>>>> My data are as follows: I have a data set on quarterly 
>>>> institutional ownership of equities; some of them have had recent 
>>>> IPOs, some have not (I have a binary flag set). The total dataset 
>>>> size is 700k+ rows.
>>>>
>>>> My goal is this: For every quarter since issue for each IPO, I need 
>>>> to find a "matched" firm in the same industry, and close in market 
>>>> cap. So, e.g., for firm X, which had an IPO, i need to find a 
>>>> matched non-issuing firm in quarter 1 since IPO, then a (possibly 
>>>> different) non-issuing firm in quarter 2 since IPO, etc. Repeat for 
>>>> each issuing firm (there are about 8300 of these).
>>>>
>>>> Thus it seems to me that I need to be doing a lot of data selection 
>>>> and subsetting, and looping (yikes!), but the result appears to be 
>>>> highly inefficient and takes ages (well, many hours). What I am 
>>>> doing, in pseudocode, is this:
>>>>
>>>> 1. for each quarter of data, getting out all the IPOs and all the 
>>>> eligible non-issuing firms.
>>>> 2. for each IPO in a quarter, grab all the non-issuers in the same 
>>>> industry, sort them by size, and finally grab a matching firm 
>>>> closest in size (the exact procedure is to grab the closest bigger 
>>>> firm if one exists, and just the biggest available if all are smaller)
>>>> 3. assign the matched firm-observation the same "quarters since 
>>>> issue" as the IPO being matched
>>>> 4. rbind them all into the "matching" dataset.
>>>>
>>>> The function I currently have is pasted below, for your reference. 
>>>> Is there any way to make it produce the same result but much 
>>>> faster? Specifically, I am guessing eliminating some loops would be 
>>>> very good, but I don't see how, since I need to do some fancy 
>>>> footwork for each IPO in each quarter to find the matching firm. 
>>>> I'll be doing a few things similar to this, so it's somewhat 
>>>> important to up the efficiency of this. Maybe some of you R-fu 
>>>> masters can clue me in? :)
>>>>
>>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>>>
>>>> ========== my function below ===========
>>>>
>>>> fcn_create_nonissuing_match_by_quarterssinceissue = 
>>>> function(tfdata, quarters_since_issue=40) {
>>>>
>>>>     result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix 
>>>> is cheaper, so typecast the result to matrix
>>>>
>>>>     colnames = names(tfdata)
>>>>
>>>>     quarterends = sort(unique(tfdata$DATE))
>>>>
>>>>     for (aquarter in quarterends) {
>>>>         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>>>
>>>>         tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
>>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) 
>>>> & (tfdata_quarter$IPO.Flag == 0), ]
>>>>         tfdata_quarter_ipoissuers = tfdata_quarter[ 
>>>> tfdata_quarter$IPO.Flag == 1, ]
>>>>
>>>>         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>>>             arow = tfdata_quarter_ipoissuers[i,]
>>>>             industrypeers = tfdata_quarter_fitting_nonissuers[ 
>>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>>>             industrypeers = industrypeers[ 
>>>> order(industrypeers$Market.Cap.13f), ]
>>>>             if ( nrow(industrypeers) > 0 ) {
>>>>                 if ( 
>>>> nrow(industrypeers[industrypeers$Market.Cap.13f >= 
>>>> arow$Market.Cap.13f, ]) > 0 ) {
>>>>                     bestpeer = 
>>>> industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, 
>>>> ][1,]
>>>>                 }
>>>>                 else {
>>>>                     bestpeer = industrypeers[nrow(industrypeers),]
>>>>                 }
>>>>                 bestpeer$Quarters.Since.IPO.Issue = 
>>>> arow$Quarters.Since.IPO.Issue
>>>>
>>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
>>>> bestpeer$PERMNO] = 1
>>>>                 result = rbind(result, as.matrix(bestpeer))
>>>>             }
>>>>         }
>>>>         #result = rbind(result, tfdata_quarter)
>>>>         print (aquarter)
>>>>     }
>>>>
>>>>     result = as.data.frame(result)
>>>>     names(result) = colnames
>>>>     return(result)
>>>>
>>>> }
>>>>
>>>> ========= end of my function =============
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list