[R] R: error using pvcm() on unbalanced panel data

Millo Giovanni Giovanni_Millo at Generali.com
Fri Feb 26 11:24:39 CET 2010

```Dear Liviu,

in general, pvcm is capable of fitting variable coefficients models on unbalanced data sets: e.g.,

> data(Grunfeld)
> grun<-Grunfeld[-1,] ## 'unbalance' it
> pvcm(inv~value+capital, data=grun)

Model Formula: inv ~ value + capital

Coefficients:
(Intercept)     value   capital
1   -193.77819 0.1272949 0.3765985
2    -49.19832 0.1748560 0.3896419
3     -9.95631 0.0265512 0.1516939
4     -6.18996 0.0779478 0.3157182
5     22.70712 0.1623777 0.0031017
6     -8.68554 0.1314548 0.0853743
7     -4.49953 0.0875272 0.1237814
8     -0.50939 0.0528941 0.0924065
9     -7.72284 0.0753879 0.0821036
10     0.16152 0.0045734 0.4373692

> pvcm(inv~value+capital, data=grun, model="random")

Model Formula: inv ~ value + capital

Coefficients:
(Intercept)       value     capital
-11.741754    0.085046    0.199632

so the problem must be within the dataset and the minimum-T requirements for fitting vc models. The error message may not be the friendliest, but it actually tells you where the problem is.
Let us look at the Hedonic data you used as an example: this is not a panel dataset but you can treat it as one, as you did, by grouping obs. for the same town. So your N is

> length(unique(Hedonic\$townid))
[1] 92

and your "T" (although here it is not "time") is:

> summary(tapply(Hedonic\$townid,Hedonic\$townid,length))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.0     2.0     4.0     5.5     7.0    30.0

as you can see, there are many "single observation" towns

> which(tapply(Hedonic\$townid,Hedonic\$townid,length)==1)
1 10 11 12 13 15 34 45 50 51 52 53 65 66 69 70 73
1 10 11 12 13 15 34 45 50 51 52 53 65 66 69 70 73

So the "within" model cannot work, as it needs T>(K+1) for estimating the separate regressions for each town. Why the "random" vcm doesn't is less straightforward.
Let us try a "reduced" pvcm with K=9 on a subset of data with T>10:

> hedo<-Hedonic[(Hedonic\$townid %in% which(tapply(Hedonic\$townid,Hedonic\$townid,length)>10)),]
> dim(hedo)
[1] 208  15
>  Hed <- pvcm(mv ~ crim + zn + indus + chas + nox + rm + age + dis +rad
+  , data=hedo, model = "within",index = "townid")

This works, but gives bad coefficients, because as it turns out there are also many time-invariant variables in the dataset, and of course these are discarded in doing timewise regressions!

A closer look at the data reveals that 'chas' is a factor, 'zn' is either 0 or 20, 'age' is truncated at 100 and so on. Let's see what's in the town with the most obs.:

> tapply(hedo\$townid,hedo\$townid,length)
5 25 28 29 39 41 46 60 80 81 83 84 85
22 11 15 30 11 18 12 12 11 13 19 23 11
> summary(hedo[hedo\$townid==29,])
mv              crim             zn        indus        chas
Min.   : 9.376   Min.   :1.127   Min.   :0   Min.   :19.58   no :23
1st Qu.: 9.655   1st Qu.:1.472   1st Qu.:0   1st Qu.:19.58   yes: 7
Median : 9.878   Median :2.152   Median :0   Median :19.58
Mean   : 9.973   Mean   :2.111   Mean   :0   Mean   :19.58
3rd Qu.:10.093   3rd Qu.:2.430   3rd Qu.:0   3rd Qu.:19.58
Max.   :10.820   Max.   :4.097   Max.   :0   Max.   :19.58
nox              rm             age              dis
Min.   :36.60   Min.   :24.04   Min.   : 79.20   Min.   :0.2788
1st Qu.:36.60   1st Qu.:30.26   1st Qu.: 93.82   1st Qu.:0.4349
Median :75.86   Median :35.69   Median : 96.05   Median :0.5615
Mean   :57.54   Mean   :37.83   Mean   : 95.16   Mean   :0.5876
3rd Qu.:75.86   3rd Qu.:39.71   3rd Qu.: 98.42   3rd Qu.:0.7354
Max.   :75.86   Max.   :70.14   Max.   :100.00   Max.   :0.8862
Min.   :1.609   Min.   :403   Min.   :14.7   Min.   :0.08801
1st Qu.:1.609   1st Qu.:403   1st Qu.:14.7   1st Qu.:0.29349
Median :1.609   Median :403   Median :14.7   Median :0.35000
Mean   :1.609   Mean   :403   Mean   :14.7   Mean   :0.31745
3rd Qu.:1.609   3rd Qu.:403   3rd Qu.:14.7   3rd Qu.:0.37402
Max.   :1.609   Max.   :403   Max.   :14.7   Max.   :0.39690
lstat            townid
Min.   :-4.058   Min.   :29
1st Qu.:-2.534   1st Qu.:29
Median :-2.064   Median :29
Mean   :-2.186   Mean   :29
3rd Qu.:-1.801   3rd Qu.:29
Max.   :-1.220   Max.   :29

whence we see that 'zn', 'indus', 'rad', 'tax', 'ptratio' and, of course, 'townid' are T-invariant, 'chas' is a factor but at least it varies between yes and no. 'nox' is also problematic in that it varies only from time to time...
A feasible formula, whatever this model means, is:

> fm <- mv ~ crim  + rm + age + dis + blacks + lstat
> newmodr <- pvcm(fm, data=hedo, model="random", index="townid")
> newmodw <- pvcm(fm, data=hedo, model="within", index="townid")
> ## all is well now

I hope hereby to have given you some methodological hint for a critical overview of your data. PS the pooltest() problem is much the same, as pooltest() needs to fit separate regressions.

Best,
Giovanni

-----Messaggio originale-----
Da: Liviu Andronic [mailto:landronimirc at gmail.com]
Inviato: giovedì 25 febbraio 2010 16:24
A: r-help at r-project.org Help
Cc: Millo Giovanni; yves.croissant at let.ish-lyon.cnrs.fr
Oggetto: error using pvcm() on unbalanced panel data

Dear all
I am trying to fit Variable Coefficients Models on Unbalanced Panel Data. I managed to fit such models on balanced panel data (the example from the "plm" vignette), but I failed to do so on my real, unbalanced panel data.

I can reproduce the error on a modified example from the vignette:
> require(plm)
> data("Hedonic")
> Hed <- pvcm(mv ~ crim + zn + indus + chas + nox + rm + age + dis +rad
> + tax + ptratio + blacks + lstat, Hedonic, model = "within",index =
> "townid")
Error in FUN(X[[1L]], ...) : insufficient number of observations
> ##it fails for both FE and RE cases
> Hed <- pvcm(mv ~ crim + zn + indus + chas + nox + rm + age + dis +rad
> + tax + ptratio + blacks + lstat, Hedonic, model = "random",index =
> "townid")
Error in FUN(X[[1L]], ...) : insufficient number of observations

Would this be expected behaviour for unbalanced data? The vignette warns of several limitations regarding such data, but doesn't mention this specific case as a limitation. I would like to subsequently perform  a test of poolability on my real data.

Thank you
Liviu

> sessionInfo()
R version 2.10.1 (2009-12-14)
x86_64-pc-linux-gnu

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] plm_1.2-3        sandwich_2.2-5   zoo_1.6-2        MASS_7.3-5
[5] Formula_0.2-0    kinship_1.1.0-23 lattice_0.18-3   nlme_3.1-96
[9] survival_2.35-8  fortunes_1.3-7   RGtk2_2.12.18    cairoDevice_2.10
[13] sos_1.2-5        brew_1.0-3       hints_1.0.1-1

loaded via a namespace (and not attached):
[1] grid_2.10.1  tools_2.10.1

Ai sensi del D.Lgs. 196/2003 si precisa che le informazi...{{dropped:13}}

```