[R] Improving data processing efficiency

Sat Jun 7 00:48:10 CEST 2008

In a case like this, if you can possibly work with matrices instead 
of data frames, you might get significant speedup.
(More accurately, I have had situations where I obtained speed up by 
working with matrices instead of dataframes.)
Even if you have to code character columns as numeric, it can be worth it.

Data frames have overhead that matrices do not. (Here's where 
profiling might have given a clue) Granted, there has been recent 
work in reducing the overhead associated with dataframes, but I think 
it's worth a try. Carrying along extra columns and doing row 
subsetting, rbinding, etc, means a lot more things happening in 
memory.

So, for example, if all of your matching is based just on a few 
columns, extract those columns, convert them to a matrix, do all the 
matching, and then based on some sort of row index retrieve all of 
the associated columns.

-Don

At 2:09 PM -0400 6/5/08, Daniel Folkinshteyn wrote:
>Hi everyone!
>
>I have a question about data processing efficiency.
>
>My data are as follows: I have a data set on quarterly institutional 
>ownership of equities; some of them have had recent IPOs, some have 
>not (I have a binary flag set). The total dataset size is 700k+ rows.
>
>My goal is this: For every quarter since issue for each IPO, I need 
>to find a "matched" firm in the same industry, and close in market 
>cap. So, e.g., for firm X, which had an IPO, i need to find a 
>matched non-issuing firm in quarter 1 since IPO, then a (possibly 
>different) non-issuing firm in quarter 2 since IPO, etc. Repeat for 
>each issuing firm (there are about 8300 of these).
>
>Thus it seems to me that I need to be doing a lot of data selection 
>and subsetting, and looping (yikes!), but the result appears to be 
>highly inefficient and takes ages (well, many hours). What I am 
>doing, in pseudocode, is this:
>
>1. for each quarter of data, getting out all the IPOs and all the 
>eligible non-issuing firms.
>2. for each IPO in a quarter, grab all the non-issuers in the same 
>industry, sort them by size, and finally grab a matching firm 
>closest in size (the exact procedure is to grab the closest bigger 
>firm if one exists, and just the biggest available if all are 
>smaller)
>3. assign the matched firm-observation the same "quarters since 
>issue" as the IPO being matched
>4. rbind them all into the "matching" dataset.
>
>The function I currently have is pasted below, for your reference. 
>Is there any way to make it produce the same result but much faster? 
>Specifically, I am guessing eliminating some loops would be very 
>good, but I don't see how, since I need to do some fancy footwork 
>for each IPO in each quarter to find the matching firm. I'll be 
>doing a few things similar to this, so it's somewhat important to up 
>the efficiency of this. Maybe some of you R-fu masters can clue me 
>in? :)
>
>I would appreciate any help, tips, tricks, tweaks, you name it! :)
>
>========== my function below ===========
>
>fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, 
>quarters_since_issue=40) {
>
>     result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is 
>cheaper, so typecast the result to matrix
>
>     colnames = names(tfdata)
>
>     quarterends = sort(unique(tfdata$DATE))
>
>     for (aquarter in quarterends) {
>         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>
>         tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
>(tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) 
>& (tfdata_quarter$IPO.Flag == 0), ]
>         tfdata_quarter_ipoissuers = tfdata_quarter[ 
>tfdata_quarter$IPO.Flag == 1, ]
>
>         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>             arow = tfdata_quarter_ipoissuers[i,]
>             industrypeers = tfdata_quarter_fitting_nonissuers[ 
>tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>             industrypeers = industrypeers[ 
>order(industrypeers$Market.Cap.13f), ]
>             if ( nrow(industrypeers) > 0 ) {
>                 if ( 
>nrow(industrypeers[industrypeers$Market.Cap.13f >= 
>arow$Market.Cap.13f, ]) > 0 ) {
>                     bestpeer = 
>industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, 
>][1,]
>                 }
>                 else {
>                     bestpeer = industrypeers[nrow(industrypeers),]
>                 }
>                 bestpeer$Quarters.Since.IPO.Issue = 
>arow$Quarters.Since.IPO.Issue
>
>#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == 
>bestpeer$PERMNO] = 1
>                 result = rbind(result, as.matrix(bestpeer))
>             }
>         }
>         #result = rbind(result, tfdata_quarter)
>         print (aquarter)
>     }
>
>     result = as.data.frame(result)
>     names(result) = colnames
>     return(result)
>
>}
>
>========= end of my function =============
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062