[R] Panel data handling (lags, growth rates)

Sun Aug 14 17:11:44 CEST 2005

On 8/14/05, Ajay Narottam Shah <ajayshah at mayin.org> wrote:
> I have written two functions which do useful things with panel data
> a.k.a. longitudinal data, where one unit of observation (a firm or a
> person or an animal) is observed on a uniform time grid:
>   - The first function makes lagged values of variables of your choice.
>   - The second function makes growth rates w.r.t. q observations ago,
>      for variables of your choice.
> 
> These strike me as bread-and-butter tasks in dealing with panel
> data. I couldn't find these functions in the standard R
> libraries. They are presented in this email for two reasons. First,
> it'll be great if R gurus can look at the code and propose
> improvements. Second, it'll be great if some package-owner can adopt
> these orphans :-) and make them available to the R community.
> 
> The two functions follow:
> 
> library(Hmisc)                          # Am using Lag() in this.
> 
> # Task: For a supplied list of variables (the list `lagvars'),
> #       make new columns in a dataset denoting lagged values.
> #       You must supply `unitvar' which identifies the unit that's
> #           repeatedly observed.
> #       You must supply the name of the time variable `timevar'
> #       and you must tell a list of the lags that interest you (`lags')
> # Example:
> #  paneldata.lags(A, "person", "year", c("v1","v2"), lags=1:4)
> paneldata.lags <- function(X, unitvar, timevar, lagvars, lags=1) {
>  stopifnot(length(lagvars)>=1)
>  X <- X[order(X[,timevar]),]           # just in case it's not sorted.
> 
>  innertask <- function(Y, lagvars, lags) {
>    E <- labels <- NULL
>    for (v in lagvars) {
>      for (i in lags) {
>        E <- cbind(E, Lag(Y[,v], i))
>      }
>      labels <- c(labels, paste(v, ".l", lags, sep=""))
>    }
>    colnames(E) <- labels
>    cbind(Y, E)
>  }
> 
>  do.call("rbind", by(X, X[,unitvar], innertask, lagvars, lags))
> }
> 
> # Task: For a supplied list of variables (the list `gvars'),
> #       make new columns in a dataset denoting growth rates.
> #       You must supply `unitvar' which identifies the unit that's
> #           repeatedly observed.
> #       You must supply the name of the time variable `timevar'
> #       and you must tell the time periods Q (vector is ok) over which
> #       the growth rates are computed.
> paneldata.growthrates <- function(X, unitvar, timevar, gvars, Q=1) {
>  stopifnot(length(gvars)>=1)
>  X <- X[order(X[,timevar]),]
> 
>  makegrowths <- function(x, q) {
>    new <- rep(NA, length(x))
>    for (t in (1+q):length(x)) {
>      new[t] <- 100*((x[t]/x[t-q])-1)
>    }
>    return(new)
>  }
> 
>  innertask <- function(Y, gvars, Q) {
>    E <- labels <- NULL
>    for (v in gvars) {
>      for (q in Q) {
>        E <- cbind(E, makegrowths(Y[,v], q))
>      }
>      labels <- c(labels, paste(v, ".g", Q, sep=""))
>    }
>    colnames(E) <- labels
>    cbind(Y, E)
>  }
> 
>  do.call("rbind", by(X, X[,unitvar], innertask, gvars, Q))
> }
> 
> Here's a demo of using them:
> 
> # A simple panel dataset
> A <- data.frame(year=rep(1980:1982,4),
>                person=factor(sort(rep(1:4,3))),
>                v1=round(rnorm(12),digits=2), v2=round(rnorm(12),digits=2))
> 
> # Demo of creating lags for both variables v1 and v2 --
> paneldata.lags(A, "person", "year", c("v1","v2"), lags=1:2)
> # Demo of creating growth rates for v2 w.r.t. 1 & 2 years ago --
> paneldata.growthrates(A, "person", "year", "v2", Q=1:2)
> 
> 
> 
> 
> Finally, I have a question. In a previous posting on this subject,
> Gabor showed me that my code:
> 
> # Blast this function for all the values that A$person takes --
> new <- NULL
> for (f in levels(A$person)) {
>  new <- rbind(new,
>               make.additional.variables(subset(A, A$person==f),
>                                         nlags=2, Q=1))
> }
> A <- new; rm(new)
> 
> can be replaced by one do.call() (as used above). It's awesome, but I
> don't understand it! :-) Could someone please explain how and why this
> works? I know by() and I see that when I do by(A,A$x), it gives me a
> list containing as many entries as are levels of A$x. I couldn't think
> of a way to force all this into one data frame; the best I could do
> was to do for (f in levels (A$person)) {} as shown here. The two
> functions above are using do.call() as Gabor used them, and they're
> awesome, but I don't understand why they work! The man page ?do.call
> was a bit too cryptic and I couldn't comprehend it. Where can I learn
> this stuff?
> 

Don't know of a source, I just study code, but 
conceptually by just splits up the rows by the grouping
argument giving a list of data frames and applies the
function to each element of the list giving the result.

For example, if we write:

f <- function(x) colSums(x[,-5])
iris.by <- by(iris, iris$Species, f)

is the same as:

f <- function(x) colSums(x[,-5])
iris.split <- split(iris, iris$Species)
iris.lapply <- lapply(iris.split, f)

except that in the by case the result gets a class of "by".

In either of the above cases the result is a list of these
three elements, i.e. these three data frames:

el1 <- iris.by[[1]]; el2 <- iris.by[[2]]; el3 <- iris.by[[3]]

Now, if g <- function(x,y)x+y then

	g(1,2)

is the same as

	do.call("g", list(1,2))

so going back the iris example, to rbind el1, el2 and el3 together we do this:

	rbind(el1, el2, el3)

which is the same as 

	do.call("rbind", list(e1, e2, e3))

which is the same as

	do.call("rbind", iris.by)