[R] Improving data processing efficiency

Sat Jun 7 00:59:02 CEST 2008

thanks for the suggestions! I'll play with this over the weekend and see 
what comes out. :)

on 06/06/2008 06:48 PM Don MacQueen said the following:
> In a case like this, if you can possibly work with matrices instead of 
> data frames, you might get significant speedup.
> (More accurately, I have had situations where I obtained speed up by 
> working with matrices instead of dataframes.)
> Even if you have to code character columns as numeric, it can be worth it.
> 
> Data frames have overhead that matrices do not. (Here's where profiling 
> might have given a clue) Granted, there has been recent work in reducing 
> the overhead associated with dataframes, but I think it's worth a try. 
> Carrying along extra columns and doing row subsetting, rbinding, etc, 
> means a lot more things happening in memory.
> 
> So, for example, if all of your matching is based just on a few columns, 
> extract those columns, convert them to a matrix, do all the matching, 
> and then based on some sort of row index retrieve all of the associated 
> columns.
> 
> -Don
> 
> At 2:09 PM -0400 6/5/08, Daniel Folkinshteyn wrote:
>> Hi everyone!
>>
>> I have a question about data processing efficiency.
>>
>> My data are as follows: I have a data set on quarterly institutional 
>> ownership of equities; some of them have had recent IPOs, some have 
>> not (I have a binary flag set). The total dataset size is 700k+ rows.
>>
>> My goal is this: For every quarter since issue for each IPO, I need to 
>> find a "matched" firm in the same industry, and close in market cap. 
>> So, e.g., for firm X, which had an IPO, i need to find a matched 
>> non-issuing firm in quarter 1 since IPO, then a (possibly different) 
>> non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing 
>> firm (there are about 8300 of these).
>>
>> Thus it seems to me that I need to be doing a lot of data selection 
>> and subsetting, and looping (yikes!), but the result appears to be 
>> highly inefficient and takes ages (well, many hours). What I am doing, 
>> in pseudocode, is this:
>>
>> 1. for each quarter of data, getting out all the IPOs and all the 
>> eligible non-issuing firms.
>> 2. for each IPO in a quarter, grab all the non-issuers in the same 
>> industry, sort them by size, and finally grab a matching firm closest 
>> in size (the exact procedure is to grab the closest bigger firm if one 
>> exists, and just the biggest available if all are smaller)
>> 3. assign the matched firm-observation the same "quarters since issue" 
>> as the IPO being matched
>> 4. rbind them all into the "matching" dataset.
>>
>> The function I currently have is pasted below, for your reference. Is 
>> there any way to make it produce the same result but much faster? 
>> Specifically, I am guessing eliminating some loops would be very good, 
>> but I don't see how, since I need to do some fancy footwork for each 
>> IPO in each quarter to find the matching firm. I'll be doing a few 
>> things similar to this, so it's somewhat important to up the 
>> efficiency of this. Maybe some of you R-fu masters can clue me in? :)
>>
>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>
>> ========== my function below ===========
>>
>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
>> quarters_since_issue=40) {
>>
>>     result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
>> cheaper, so typecast the result to matrix
>>
>>     colnames = names(tfdata)
>>
>>     quarterends = sort(unique(tfdata$DATE))
>>
>>     for (aquarter in quarterends) {
>>         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>
>>         tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & 
>> (tfdata_quarter$IPO.Flag == 0), ]
>>         tfdata_quarter_ipoissuers = tfdata_quarter[ 
>> tfdata_quarter$IPO.Flag == 1, ]
>>
>>         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>             arow = tfdata_quarter_ipoissuers[i,]
>>             industrypeers = tfdata_quarter_fitting_nonissuers[ 
>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>             industrypeers = industrypeers[ 
>> order(industrypeers$Market.Cap.13f), ]
>>             if ( nrow(industrypeers) > 0 ) {
>>                 if ( nrow(industrypeers[industrypeers$Market.Cap.13f 
>> >= arow$Market.Cap.13f, ]) > 0 ) {
>>                     bestpeer = 
>> industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
>>                 }
>>                 else {
>>                     bestpeer = industrypeers[nrow(industrypeers),]
>>                 }
>>                 bestpeer$Quarters.Since.IPO.Issue = 
>> arow$Quarters.Since.IPO.Issue
>>
>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
>> bestpeer$PERMNO] = 1
>>                 result = rbind(result, as.matrix(bestpeer))
>>             }
>>         }
>>         #result = rbind(result, tfdata_quarter)
>>         print (aquarter)
>>     }
>>
>>     result = as.data.frame(result)
>>     names(result) = colnames
>>     return(result)
>>
>> }
>>
>> ========= end of my function =============
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
>