[R] Problems when using lag() in plm package

Tue Dec 29 18:08:07 CET 2009

Hi,

I've been trying out the plm package, which seems like a great boon to 
those who want to analyze panel data in R. I haven't started to use the 
estimation functions themselves - for now I am just interested in having 
a robust way to deal with lags in unbalanced panel data, since it is 
such a royal pain to deal with all the special cases.

However, In my tests, I found behavior that seems strange at a minimum, 
and potentially a bug, and I would like to understand it better. I 
demonstrate these using an example, which I include below.

Basically, I want the function to deal "correctly" with a panel that 
contains a unit (unit 1 in the example), that has a gap (missing entry 
for a particular point in time (year 4 in the example)).

What the example demonstrates is that the outcome when the unit 1 
observations are lagged is different based on whether year 4 is present 
in the observations on *unit 2*.

If year 4 is present for unit 2, the lagged pseries is suitable for 
binding to the original data frame, with missing values at the correct 
locations. The names() for the lagged series are incorrect, but I don't 
really care about them. So this is basically the behavior I had hoped to 
see.

However, if year 4 is not present in unit2, the gap is not detected. A 
cbind() with the original series is now incorrect data, although the 
names() of the lagged series could now be interpreted as correct in the 
strictest sense. However, if this is the expected behavior, that means 
that all estimation functions will have to examine the names() of each 
series, which seems like a lot of work.

My question then is: How should I interpret these results? Are gaps in 
the data disallowed? And if so, should the creation of a pdata.frame 
with gaps result in an error?

MY EXAMPLE FOLLOWS:

 > # Construct pdata.frame with gap in unit 1, at year 4
 > pdu <- pdata.frame(
+     data.frame(
+         i=c(rep(1,6),rep(2,3)),
+         t=c(1:3,5:7,2:4),
+         x=1:9
+     )
+ )
 > # Using cbind() to view the series with its lagged
 > # counterpart produces the expected result
 > cbind( pdu$i, pdu$t, pdu$x , lag(pdu$x))
     [,1] [,2] [,3] [,4]
1-1    1    1    1   NA
1-2    1    2    2    1
1-3    1    3    3    2
1-5    1    5    4   NA
1-6    1    6    5    4
1-7    1    7    6    5
2-2    2    2    7   NA
2-3    2    3    8    7
2-4    2    4    9    8
 > # The labels of the lagged series do not seem correct
 > # (the second NA should be labeled as 1-4, not as 1-3),
 > lag(pdu$x)
     1-1 1-2 1-3 1-5 1-6 1-7 2-2 2-3
  NA   1   2  NA   4   5  NA   7   8
attr(,"class")
[1] "integer"
 >
 > # Again, construct pdata.frame with (the same) gap in
 > # unit 1, but now, that time observation (4), is
 > # not present in unit 2 either
 > pdu <- pdata.frame(
+     data.frame(
+         i=c(rep(1,6),rep(2,3)),
+         t=c(1:3,5:7,1:3),
+         x=1:9
+     )
+ )
 > # Now the cbin() of the two series seems wrong
 > cbind( pdu$i, pdu$t, pdu$x , lag(pdu$x))
     [,1] [,2] [,3] [,4]
1-1    1    1    1   NA
1-2    1    2    2    1
1-3    1    3    3    2
1-5    1    4    4    3
1-6    1    5    5    4
1-7    1    6    6    5
2-1    2    1    7   NA
2-2    2    2    8    7
2-3    2    3    9    8
 > # But the labels of the lagged series could be
 > # interpreted as being correct in this case
 > lag(pdu$x)
     1-1 1-2 1-3 1-5 1-6 1-7 2-1 2-2
  NA   1   2   3   4   5  NA   7   8
attr(,"class")
[1] "integer"
 >

Best regards,
Magnus