[R] linear model coefficients by year and industry, fitted values, residuals, panel data
Peter Ehlers
ehlers at ucalgary.ca
Wed Apr 3 20:01:28 CEST 2013
A few minor improvements to Jean's post suggested inline below.
On 2013-04-03 05:41, Adams, Jean wrote:
> Cecilia,
>
> Thanks for providing a reproducible example. Excellent.
>
> You could use the ddply() function in the plyr package to fit the model for
> each industry and year, keep the coefficients, and then estimate the fitted
> and residual values.
>
> Jean
>
> library(plyr)
> coef <- ddply(final3, .(industry, year), function(dat) lm(Y ~ X + Z,
> data=dat)$coef)
> names(coef) <- c("industry", "year", "b0", "b1", "b2")
> final4 <- merge(final3, coef)
> newdata1 <- transform(final4, Yhat = b0 + b1*X + b2*Z)
> newdata2 <- transform(newdata1, residual = Y-Yhat)
> plot(as.factor(newdata2$firm), newdata2$residual)
Suggestion 1:
Use the extractor function coef() and also avoid using the name
of an R function as a variable name:
Coef <- ddply(...., function(dat) coef(lm(....)))
Suggestion 2:
Use plyr's mutate() to do both transforms at once:
newdata <- mutate(final4,
Yhat = b0 + b1*X + b2*Z,
residual = Y-Yhat)
[Or you could use within(), but I now find mutate handier, mainly
because it doesn't 'reverse' the order of the new variables.]
Suggestion 3:
Use the 'data=' argument in the plot:
boxplot(residual ~ firm, data = newdata)
Peter Ehlers
>
> On Wed, Apr 3, 2013 at 3:38 AM, Cecilia Carmo <cecilia.carmo at ua.pt> wrote:
>
>> Hi R-helpers,
>>
>>
>>
>> My real data is a panel (unbalanced and with gaps in years) of thousands
>> of firms, by year and industry, and with financial information (variables
>> X, Y, Z, for example), the number of firms by year and industry is not
>> always equal, the number of years by industry is not always equal.
>>
>>
>>
>> #reproducible example
>> firm1<-sort(rep(1:10,5),decreasing=F)
>> year1<-rep(2000:2004,10)
>> industry1<-rep(20,50)
>> X<-rnorm(50)
>> Y<-rnorm(50)
>> Z<-rnorm(50)
>> data1<-data.frame(firm1,year1,industry1,X,Y,Z)
>> data1
>> colnames(data1)<-c("firm","year","industry","X","Y","Z")
>>
>>
>>
>> firm2<-sort(rep(11:15,3),decreasing=F)
>> year2<-rep(2001:2003,5)
>> industry2<-rep(30,15)
>> X<-rnorm(15)
>> Y<-rnorm(15)
>> Z<-rnorm(15)
>> data2<-data.frame(firm2,year2,industry2,X,Y,Z)
>> data2
>> colnames(data2)<-c("firm","year","industry","X","Y","Z")
>>
>> firm3<-sort(rep(16:20,4),decreasing=F)
>> year3<-rep(2001:2004,5)
>> industry3<-rep(40,20)
>> X<-rnorm(20)
>> Y<-rnorm(20)
>> Z<-rnorm(20)
>> data3<-data.frame(firm3,year3,industry3,X,Y,Z)
>> data3
>> colnames(data3)<-c("firm","year","industry","X","Y","Z")
>>
>>
>>
>> final1<-rbind(data1,data2)
>> final2<-rbind(final1,data3)
>> final2
>> final3<-final2[order(final2$industry,final2$year),]
>> final3
>>
>>
>>
>> I need to estimate a linear model Y = b0 + b1X + b2Z by industry and year,
>> to obtain the estimates of b0, b1 and b2 by industry and year (for example
>> I need to have de b0 for industry 20 and year 2000, for industry 20 and
>> year 2001...). Then I need to calculate the fitted values and the residuals
>> by firm so I need to keep b0, b1 and b2 in a way that I could do something
>> like
>> newdata1<-transform(final3,Y'=b0+b1.X+b2.Z)
>> newdata2<-transform(newdata1,residual=Y-Y')
>> or another way to keep Y' and the residuals in a dataframe with the
>> columns firm and year.
>>
>>
>>
>> Until now I have been doing this in very hard way and because I need to do
>> it several times, I need your help to get an easier way.
>>
>>
>>
>> Thank you,
>>
>>
>>
>> Cecília Carmo
>>
>> Universidade de Aveiro
>>
>> Portugal
>>
More information about the R-help
mailing list