[R] CORRECTION: Re: Multicollinearity with brglm?
Ioannis Kosmidis
I.Kosmidis at warwick.ac.uk
Thu Apr 2 16:56:00 CEST 2009
Thanks for your mail. I guess that the constant row sum on X would create
problems in a simulation framework because you might end up with linearly
dependent columns or even with columns of zeros (which I believe do not make
much sense).
First of all, I think there is a problem with your example below. For this X
two columns should be eliminated if a constant is to be included in the model
and in summary(mod.simple.brglm) only one appears to be eliminated.
The reason for eliminating columns is merely to report a parameterization that
is identifiable.
For example, consider a single binomial variable with
observed value 2 and total number of trials 10. Also, let's suppose that we
are interested on the log-odds of success beta1. The
estimated log-odds for this sample is
hat{beta1} = -1.386
so that the fitted probability is 0.2.
If another constant, say beta2, is introduced in the model then
there is a whole infinity of values that the vector (beta1,beta2) can take
for giving fitted probability 0.2 (like for example (-1 , -0.386) or
(-10^8 , 10^8 - 1.386) and no choice is better than another. So glm chooses
to eliminate one of the two constants in order to get an identifiable
parameterization for which for a specific value of beta1 there corresponds
one and only one value of the fitted probability.
I hope this helps.
Best wishes,
Ioannis
On Thursday 02 April 2009 12:43:37 woodbomb wrote:
> Ioannis,
>
> Here's an illustrative example. Note that: glm also objects to X4; X1,..,X4
> are defined as factors.
>
> I've looked (albeit in a crude way) at various examples using the perturb
> package and it seems to confirm that X4 is the source of multicollinearity.
> As I say, I think the constant row-sum condition is the source of the
> problem, but I'm not sure why or how to deal with it.
>
> Thanks for your interest (and for the finite parameter estimates brglm
> provides)!
>
> >attributes(x)
>
> $names
> [1] "X1" "X2" "X3" "X4"
>
> $row.names
> [1] "2" "3" "4" "5"
>
> $class
> [1] "data.frame"
>
> >x
>
> X1 X2 X3 X4
> 2 0 1 0 1
> 3 0 1 1 0
> 4 1 0 0 1
> 5 1 0 1 0
>
> >attributes(y)
>
> $dim
> [1] 4 2
>
> $dimnames
> $dimnames[[1]]
> NULL
>
>
> $dimnames[[2]]
> [1] "s" "f"
>
> >y
>
> s f
> [1,] 3 7
> [2,] 2 8
> [3,] 5 5
> [4,] 3 7
>
> >summary(mod.simple)
>
> Call:
> brglm(formula = cbind(s, f) ~ X1 + X2 + X3 + X4, family = binomial,
> data = data)
>
>
> Coefficients: (1 not defined because of singularities)
>
> (Dispersion parameter for binomial family taken to be 1)
>
> Null deviance: 4.5797 on 5 degrees of freedom
> Residual deviance: 3.6469 on 2 degrees of freedom
> Penalized deviance: -1.79616
> AIC: 26.793
>
> >summary(mod.simple.brglm)
>
> Call:
> glm(formula = cbind(s, f) ~ X1 + X2 + X3 + X4, family = binomial,
> data = data)
>
> Deviance Residuals:
> 1 2 3 4 5 6
> 0.7103 -1.0256 0.3445 0.3760 -1.1876 0.6072
>
> Coefficients: (1 not defined because of singularities)
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -1.356e+00 9.219e-01 -1.471 0.141
> X11 2.445e-01 7.003e-01 0.349 0.727
> X21 7.264e-01 7.048e-01 1.031 0.303
> X31 6.316e-14 6.959e-01 9.08e-14 1.000
> X41 NA NA NA NA
>
> (Dispersion parameter for binomial family taken to be 1)
>
> Null deviance: 5.0363 on 5 degrees of freedom
> Residual deviance: 3.5957 on 2 degrees of freedom
> AIC: 26.742
>
> Number of Fisher Scoring iterations: 4
More information about the R-help
mailing list