[R] Improving data processing efficiency

Sat Jun 7 00:10:25 CEST 2008

Hmm... ok... so i ran the code twice - once with a preallocated result, 
assigning rows to it, and once with a nrow=0 result, rbinding rows to 
it, for the first 20 quarters. There was no speedup. In fact, running 
with a preallocated result matrix was slower than rbinding to the matrix:

for preallocated matrix:
Time difference of 1.577779 mins

for rbinding:
Time difference of 1.498628 mins

(the time difference only counts from the start of the loop til the end, 
so the time to allocate the empty matrix was /not/ included in the time 
count).

So, it appears that rbinding a matrix is not the bottleneck. (That it 
was actually faster than assigning rows could have been a random anomaly 
(e.g. some other process eating a bit of cpu during the run?), or not - 
at any rate, it doesn't make an /appreciable/ difference.

Any other suggestions? :)

on 06/06/2008 02:03 PM Patrick Burns said the following:
> That is going to be situation dependent, but if you
> have a reasonable upper bound, then that will be
> much easier and not far from optimal.
> 
> If you pick the possibly too small route, then increasing
> the size in largish junks is much better than adding
> a row at a time.
> 
> Pat
> 
> Daniel Folkinshteyn wrote:
>> thanks for the tip! i'll try that and see how big of a difference that 
>> makes... if i am not sure what exactly the size will be, am i better 
>> off making it larger, and then later stripping off the blank rows, or 
>> making it smaller, and appending the missing rows?
>>
>> on 06/06/2008 11:44 AM Patrick Burns said the following:
>>> One thing that is likely to speed the code significantly
>>> is if you create 'result' to be its final size and then
>>> subscript into it.  Something like:
>>>
>>>   result[i, ] <- bestpeer
>>>
>>> (though I'm not sure if 'i' is the proper index).
>>>
>>> Patrick Burns
>>> patrick at burns-stat.com
>>> +44 (0)20 8525 0696
>>> http://www.burns-stat.com
>>> (home of S Poetry and "A Guide for the Unwilling S User")
>>>
>>> Daniel Folkinshteyn wrote:
>>>> Anybody have any thoughts on this? Please? :)
>>>>
>>>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>>>>> Hi everyone!
>>>>>
>>>>> I have a question about data processing efficiency.
>>>>>
>>>>> My data are as follows: I have a data set on quarterly 
>>>>> institutional ownership of equities; some of them have had recent 
>>>>> IPOs, some have not (I have a binary flag set). The total dataset 
>>>>> size is 700k+ rows.
>>>>>
>>>>> My goal is this: For every quarter since issue for each IPO, I need 
>>>>> to find a "matched" firm in the same industry, and close in market 
>>>>> cap. So, e.g., for firm X, which had an IPO, i need to find a 
>>>>> matched non-issuing firm in quarter 1 since IPO, then a (possibly 
>>>>> different) non-issuing firm in quarter 2 since IPO, etc. Repeat for 
>>>>> each issuing firm (there are about 8300 of these).
>>>>>
>>>>> Thus it seems to me that I need to be doing a lot of data selection 
>>>>> and subsetting, and looping (yikes!), but the result appears to be 
>>>>> highly inefficient and takes ages (well, many hours). What I am 
>>>>> doing, in pseudocode, is this:
>>>>>
>>>>> 1. for each quarter of data, getting out all the IPOs and all the 
>>>>> eligible non-issuing firms.
>>>>> 2. for each IPO in a quarter, grab all the non-issuers in the same 
>>>>> industry, sort them by size, and finally grab a matching firm 
>>>>> closest in size (the exact procedure is to grab the closest bigger 
>>>>> firm if one exists, and just the biggest available if all are smaller)
>>>>> 3. assign the matched firm-observation the same "quarters since 
>>>>> issue" as the IPO being matched
>>>>> 4. rbind them all into the "matching" dataset.
>>>>>
>>>>> The function I currently have is pasted below, for your reference. 
>>>>> Is there any way to make it produce the same result but much 
>>>>> faster? Specifically, I am guessing eliminating some loops would be 
>>>>> very good, but I don't see how, since I need to do some fancy 
>>>>> footwork for each IPO in each quarter to find the matching firm. 
>>>>> I'll be doing a few things similar to this, so it's somewhat 
>>>>> important to up the efficiency of this. Maybe some of you R-fu 
>>>>> masters can clue me in? :)
>>>>>
>>>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>>>>
>>>>> ========== my function below ===========
>>>>>
>>>>> fcn_create_nonissuing_match_by_quarterssinceissue = 
>>>>> function(tfdata, quarters_since_issue=40) {
>>>>>
>>>>>     result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix 
>>>>> is cheaper, so typecast the result to matrix
>>>>>
>>>>>     colnames = names(tfdata)
>>>>>
>>>>>     quarterends = sort(unique(tfdata$DATE))
>>>>>
>>>>>     for (aquarter in quarterends) {
>>>>>         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>>>>
>>>>>         tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
>>>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) 
>>>>> & (tfdata_quarter$IPO.Flag == 0), ]
>>>>>         tfdata_quarter_ipoissuers = tfdata_quarter[ 
>>>>> tfdata_quarter$IPO.Flag == 1, ]
>>>>>
>>>>>         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>>>>             arow = tfdata_quarter_ipoissuers[i,]
>>>>>             industrypeers = tfdata_quarter_fitting_nonissuers[ 
>>>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>>>>             industrypeers = industrypeers[ 
>>>>> order(industrypeers$Market.Cap.13f), ]
>>>>>             if ( nrow(industrypeers) > 0 ) {
>>>>>                 if ( 
>>>>> nrow(industrypeers[industrypeers$Market.Cap.13f >= 
>>>>> arow$Market.Cap.13f, ]) > 0 ) {
>>>>>                     bestpeer = 
>>>>> industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, 
>>>>> ][1,]
>>>>>                 }
>>>>>                 else {
>>>>>                     bestpeer = industrypeers[nrow(industrypeers),]
>>>>>                 }
>>>>>                 bestpeer$Quarters.Since.IPO.Issue = 
>>>>> arow$Quarters.Since.IPO.Issue
>>>>>
>>>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
>>>>> bestpeer$PERMNO] = 1
>>>>>                 result = rbind(result, as.matrix(bestpeer))
>>>>>             }
>>>>>         }
>>>>>         #result = rbind(result, tfdata_quarter)
>>>>>         print (aquarter)
>>>>>     }
>>>>>
>>>>>     result = as.data.frame(result)
>>>>>     names(result) = colnames
>>>>>     return(result)
>>>>>
>>>>> }
>>>>>
>>>>> ========= end of my function =============
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>