[R] logistic regression for a data set with perfect separation

David Firth david.firth at nuffield.oxford.ac.uk
Wed Sep 10 20:39:39 CEST 2003


On Wednesday, Sep 10, 2003, at 18:50 Europe/London, Christoph Lehmann 
wrote:

> Dear R experts
>
> I have the follwoing data
>           V1 V2
> 1 -5.8000000  0
> 2 -4.8000000  0
> 3 -2.8666667  0
> 4 -0.8666667  0
> 5 -0.7333333  0
> 6 -1.6666667  0
> 7 -0.1333333  1
> 8  1.2000000  1
> 9  1.3333333  1
>
> and I want to know, whether V1 can predict V2: of course it can, since
> there is a perfect separation between cases 1..6 and 7..9
>
> How can I test, whether this conclusion (being able to assign an
> observation i to class j, only knowing its value on Variable V1)  holds
> also for the population, our data were drawn from?

For this you really need more data.  The only way you'll ever be able 
to reject that hypothesis is by finding an instance of 010 or 101 in 
the (ordered by V1) sample.  And if you find such then you can reject 
with certainty.

>
> Means, which inference procedure is recommended? Logistic regression,
> for obvious reasons makes no sense.

Not so obvious to me!  Logistic regression still makes sense, but care 
is needed in the method of estimation/inference.  The maximum 
likelihood solution in the above case is a model which says V2 is 1 
with certainty at some values of V1, and is zero with certainty at 
other values; and that seems an unwarranted inference with so little 
data.  That's a criticism of maximum likelihood, rather than a 
criticism of logistic regression.  (Think about the more extreme 
situation of tossing a coin once: if a head is observed, the ML 
solution is that the coin lands heads with certainty, ie that there no 
chance of tails.)

There are alternative (Bayesian and pseudo-Bayesian) methods of 
inference which can yield more sensible answers in general.  [One such 
is implemented in package brlr ("bias reduced logistic regression") on 
CRAN.]  To "test" the hypothesis described above, though, with the data 
you have, would seem to require a fully Bayesian analysis whose 
conclusions would depend strongly on the prior probability attached to 
the hypothesis.  ie you need more data...

I hope that helps in some way!

Regards,
David


>
> Many thanks for your help
>
> Christoph
> -- 
> Christoph Lehmann <christoph.lehmann at gmx.ch>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help




More information about the R-help mailing list