[R] binomial glm for relevant feature selection?
ripley@stats.ox.ac.uk
ripley at stats.ox.ac.uk
Mon Nov 11 08:32:41 CET 2002
On Sun, 10 Nov 2002, Ben Liblit wrote:
> As suggested in my earlier message, I have a large population of
> independent variables and a binary dependent outcome. It is expected
> that only a few of the independent variables actually contribute to the
> outcome, and I'd like to find those.
>
> If it wasn't already obvious, I am *not* a statistician. Not even
> close. :-) Statistician colleagues have suggested that I use logistic
> regression for this problem. My understanding is that logistic
> regression is available in R as glm(..., family=binomial).
>
> When I use this solver on fictitious data, though, the answers I expect
> are not the answers I see. Consider the following fictitious data,
> where "z" is the dependent binary outcome, "y" is irrelevant noise, and
> "x" is actually relevant to predicting the outcome:
>
> x y z
> 1 8 7 1
> 2 8 3 1
> 3 0 5 0
> 4 0 9 0
> 5 8 1 1
>
> If I feed this data to glm(z ~ x + y) using the default gaussian family,
> the results make some sense to me. The estimated coefficient for x is
> positive and the corresponding "Pr(>|t|)" value is tiny (<2e-16), which
> I take to imply a high degree of confidence that larger values of x
> correlate with increased likelihood of z. Conversely, the estimated
> coefficient for y has a "Pr(>|t|)" value of 0.552, which I take to imply
> that there is no strong correlation between y and z. Good.
>
> However, I've been told that I want to use family=binomial for a
> logistic regression problem with a binary dependent outcome like this.
> If I give this data to glm(z ~ x + y, family=binomial), the results
> become quite mysterious. I receive a warning that "Algorithm did not
> converge". The "Pr(>|t|)" values for x and y are 0.916 and 1.000
> respectively, which would seem to indicate that neither one correlates
> with the outcome.
>
> I realize that this is not a problem with R. It is a problem with my
> understanding of what R is doing. But you all have been so helpful thus
> far, perhaps I can impose on you to give me one more clue? What am I
> doing wrong here? What should I be looking at that I'm not?
Your problem is linearly separable, and you are seeing the Hauck-Donner
effect. This is rare (but by no means unknown) in real problems, and
means the Wald test as used by the t values is unreliable.
More details in Venables & Ripley (1999, 2002), look Hauck-Donner up in
the index. It's a technical point and the explanation is technical, but
there is also a practical summary there.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list