[R] efficiency in merging two data frames

Steve Miller steve.miller at jhu.edu
Mon May 1 14:54:54 CEST 2006


I'm sure you'll get ingenious responses to help you optimize your R code. I
deal with similar investment data in even larger numbers (e.g. 10 years of
daily return data for each stock in the Russell 3000), and prefer reading
and consolidating the data in Python using dictionaries and lists, then
either piping the data to R in a read statement (read.table("pipe
python...")) or using Rpy to write R data frames directly from Python.
Python is more facile with these basic data manipulations for hundreds of
thousands or even millions of records, and performance is generally
considerably better.

Steve Miller 

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Guojun Zhu
Sent: Monday, May 01, 2006 2:35 AM
To: r-help at stat.math.ethz.ch
Subject: [R] efficiency in merging two data frames

I have two data sets about lots of companies' stock
and fiscal data.  One is monthly data with about
144,000 lines, and the other is quaterly with about
56,000.  Each data set takes different company code. 
I need to merge these two together.  I read both ask
cvs.  And the other file with corresponding firm code.
 Now I have three data sets. return$PERMNO,
account$GVKEY.  id is the data frames of the
corresponding relation and has both id$PERMNO and
id$GVKEY.  Also, I need to convert the return's month
into quarter and finally merge two data frames(return
and account).  I end up write a short program for
this, but it runs very slow.  15+ minutes.  Is there
quick way to do it.  Here is my original codes.



id$fy=rep(0,length(id$PERMNO))
for (i in 1:length(id$PERMNO))

id$fy[[i]]<-account$FYR[id$GVKEY[[i]]==account$GVKEY][[1]]

return$GVKEY=rep(0,length(return$PERMNO))
return$fyy=rep(0,length(return$PERMNO))
return$fyq=rep(0,length(return$PERMNO))
for (i in i:length(return$PERMNO)) {
    temp<-id$PERMNO==return$PERMNO[[i]];
    tempmon<-id$fy[temp][[1]];
    if (return$month[[i]]<-tempmon) {
	return$fyy[[i]]<-return$year[[i]];
	return$fyq[[i]]<-4-(tempmon-return$month[[i]])%/%3;
	}
      else{
	return$fyy[[i]]<-return$year[[i]]+1;
	return$fyq[[i]]<-(return$month[[i]]-tempmon-1)%/%3;
	}
    return$GVKEY[[i]]<-id$GVKEY[temp][[1]];
}
   
returnnew=merge(return,account,by.x<-c("GVKEY","fyy","fyq"),by.y<-c("GVKEY",
"fyy","fyq"))

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list