[R] R: error using pvcm() on unbalanced panel data

Fri Feb 26 11:24:39 CET 2010

Dear Liviu,

in general, pvcm is capable of fitting variable coefficients models on unbalanced data sets: e.g.,

> data(Grunfeld)
> grun<-Grunfeld[-1,] ## 'unbalance' it
> pvcm(inv~value+capital, data=grun)

Model Formula: inv ~ value + capital

Coefficients:
   (Intercept)     value   capital
1   -193.77819 0.1272949 0.3765985
2    -49.19832 0.1748560 0.3896419
3     -9.95631 0.0265512 0.1516939
4     -6.18996 0.0779478 0.3157182
5     22.70712 0.1623777 0.0031017
6     -8.68554 0.1314548 0.0853743
7     -4.49953 0.0875272 0.1237814
8     -0.50939 0.0528941 0.0924065
9     -7.72284 0.0753879 0.0821036
10     0.16152 0.0045734 0.4373692

> pvcm(inv~value+capital, data=grun, model="random")

Model Formula: inv ~ value + capital

Coefficients:
(Intercept)       value     capital 
 -11.741754    0.085046    0.199632 

so the problem must be within the dataset and the minimum-T requirements for fitting vc models. The error message may not be the friendliest, but it actually tells you where the problem is.
Let us look at the Hedonic data you used as an example: this is not a panel dataset but you can treat it as one, as you did, by grouping obs. for the same town. So your N is

> length(unique(Hedonic$townid))
[1] 92

and your "T" (although here it is not "time") is:

> summary(tapply(Hedonic$townid,Hedonic$townid,length))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     2.0     4.0     5.5     7.0    30.0 

as you can see, there are many "single observation" towns

> which(tapply(Hedonic$townid,Hedonic$townid,length)==1)
 1 10 11 12 13 15 34 45 50 51 52 53 65 66 69 70 73 
 1 10 11 12 13 15 34 45 50 51 52 53 65 66 69 70 73 

So the "within" model cannot work, as it needs T>(K+1) for estimating the separate regressions for each town. Why the "random" vcm doesn't is less straightforward.
Let us try a "reduced" pvcm with K=9 on a subset of data with T>10:

> hedo<-Hedonic[(Hedonic$townid %in% which(tapply(Hedonic$townid,Hedonic$townid,length)>10)),]
> dim(hedo)
[1] 208  15
>  Hed <- pvcm(mv ~ crim + zn + indus + chas + nox + rm + age + dis +rad 
+  , data=hedo, model = "within",index = "townid")

This works, but gives bad coefficients, because as it turns out there are also many time-invariant variables in the dataset, and of course these are discarded in doing timewise regressions!

A closer look at the data reveals that 'chas' is a factor, 'zn' is either 0 or 20, 'age' is truncated at 100 and so on. Let's see what's in the town with the most obs.:

> tapply(hedo$townid,hedo$townid,length)
 5 25 28 29 39 41 46 60 80 81 83 84 85 
22 11 15 30 11 18 12 12 11 13 19 23 11 
> summary(hedo[hedo$townid==29,])
       mv              crim             zn        indus        chas   
 Min.   : 9.376   Min.   :1.127   Min.   :0   Min.   :19.58   no :23  
 1st Qu.: 9.655   1st Qu.:1.472   1st Qu.:0   1st Qu.:19.58   yes: 7  
 Median : 9.878   Median :2.152   Median :0   Median :19.58           
 Mean   : 9.973   Mean   :2.111   Mean   :0   Mean   :19.58           
 3rd Qu.:10.093   3rd Qu.:2.430   3rd Qu.:0   3rd Qu.:19.58           
 Max.   :10.820   Max.   :4.097   Max.   :0   Max.   :19.58           
      nox              rm             age              dis        
 Min.   :36.60   Min.   :24.04   Min.   : 79.20   Min.   :0.2788  
 1st Qu.:36.60   1st Qu.:30.26   1st Qu.: 93.82   1st Qu.:0.4349  
 Median :75.86   Median :35.69   Median : 96.05   Median :0.5615  
 Mean   :57.54   Mean   :37.83   Mean   : 95.16   Mean   :0.5876  
 3rd Qu.:75.86   3rd Qu.:39.71   3rd Qu.: 98.42   3rd Qu.:0.7354  
 Max.   :75.86   Max.   :70.14   Max.   :100.00   Max.   :0.8862  
      rad             tax         ptratio         blacks       
 Min.   :1.609   Min.   :403   Min.   :14.7   Min.   :0.08801  
 1st Qu.:1.609   1st Qu.:403   1st Qu.:14.7   1st Qu.:0.29349  
 Median :1.609   Median :403   Median :14.7   Median :0.35000  
 Mean   :1.609   Mean   :403   Mean   :14.7   Mean   :0.31745  
 3rd Qu.:1.609   3rd Qu.:403   3rd Qu.:14.7   3rd Qu.:0.37402  
 Max.   :1.609   Max.   :403   Max.   :14.7   Max.   :0.39690  
     lstat            townid  
 Min.   :-4.058   Min.   :29  
 1st Qu.:-2.534   1st Qu.:29  
 Median :-2.064   Median :29  
 Mean   :-2.186   Mean   :29  
 3rd Qu.:-1.801   3rd Qu.:29  
 Max.   :-1.220   Max.   :29  

whence we see that 'zn', 'indus', 'rad', 'tax', 'ptratio' and, of course, 'townid' are T-invariant, 'chas' is a factor but at least it varies between yes and no. 'nox' is also problematic in that it varies only from time to time...
A feasible formula, whatever this model means, is:

> fm <- mv ~ crim  + rm + age + dis + blacks + lstat
> newmodr <- pvcm(fm, data=hedo, model="random", index="townid")
> newmodw <- pvcm(fm, data=hedo, model="within", index="townid")
> ## all is well now

I hope hereby to have given you some methodological hint for a critical overview of your data. PS the pooltest() problem is much the same, as pooltest() needs to fit separate regressions.

Best,
Giovanni

-----Messaggio originale-----
Da: Liviu Andronic [mailto:landronimirc at gmail.com] 
Inviato: giovedì 25 febbraio 2010 16:24
A: r-help at r-project.org Help
Cc: Millo Giovanni; yves.croissant at let.ish-lyon.cnrs.fr
Oggetto: error using pvcm() on unbalanced panel data

Dear all
I am trying to fit Variable Coefficients Models on Unbalanced Panel Data. I managed to fit such models on balanced panel data (the example from the "plm" vignette), but I failed to do so on my real, unbalanced panel data.

I can reproduce the error on a modified example from the vignette:
> require(plm)
> data("Hedonic")
> Hed <- pvcm(mv ~ crim + zn + indus + chas + nox + rm + age + dis +rad 
> + tax + ptratio + blacks + lstat, Hedonic, model = "within",index = 
> "townid")
Error in FUN(X[[1L]], ...) : insufficient number of observations
> ##it fails for both FE and RE cases
> Hed <- pvcm(mv ~ crim + zn + indus + chas + nox + rm + age + dis +rad 
> + tax + ptratio + blacks + lstat, Hedonic, model = "random",index = 
> "townid")
Error in FUN(X[[1L]], ...) : insufficient number of observations

Would this be expected behaviour for unbalanced data? The vignette warns of several limitations regarding such data, but doesn't mention this specific case as a limitation. I would like to subsequently perform  a test of poolability on my real data.

Thank you
Liviu

> sessionInfo()
R version 2.10.1 (2009-12-14)
x86_64-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] plm_1.2-3        sandwich_2.2-5   zoo_1.6-2        MASS_7.3-5
 [5] Formula_0.2-0    kinship_1.1.0-23 lattice_0.18-3   nlme_3.1-96
 [9] survival_2.35-8  fortunes_1.3-7   RGtk2_2.12.18    cairoDevice_2.10
[13] sos_1.2-5        brew_1.0-3       hints_1.0.1-1

loaded via a namespace (and not attached):
[1] grid_2.10.1  tools_2.10.1

Ai sensi del D.Lgs. 196/2003 si precisa che le informazi...{{dropped:13}}