[R] OT: (quasi-?) separation in a logistic GLM

Tue Dec 16 03:10:57 CET 2008

If you look at the distribution of those with analogs==TRUE versus the  
whole groups it is not surprising that the upper range of Dij's result  
in a very low probability estimate:

  plot(density(dat$Dij))
lines(density(dat[dat$analogs == TRUE, 2]))

Appears as though more than 25% of the predict(mod, type="response")  
elements will satisfy
all.equal(0, <predict>) which on my machine occurs when <predict> is  
(roughly)  below 1e-9

Best;

David Winsemius

On Dec 15, 2008, at 1:03 PM, Gavin Simpson wrote:

> Dear List,
>
> Apologies for this off-topic post but it is R-related in the sense  
> that
> I am trying to understand what R is telling me with the data to hand.
>
> ROC curves have recently been used to determine a dissimilarity
> threshold for identifying whether two samples are from the same "type"
> or not. Given the bashing that ROC curves get whenever anyone asks  
> about
> them on this list (and having implemented the ROC methodology in my
> analogue package) I wanted to try directly modelling the probability
> that two sites are analogues for one another for given dissimilarity
> using glm().
>
> The data I have then are a logical vector ('analogs') indicating  
> whether
> the two sites come from the same vegetation and a vector of the
> dissimilarity between the two sites ('Dij'). These are in a csv file
> currently in my university web space. Each 'row' in this file
> corresponds to single comparison between 2 sites.
>
> When I analyse these data using glm() I get the familiar "fitted
> probabilities numerically 0 or 1 occurred" warning. The data do not  
> look
> linearly separable when plotted (code for which is below). I have read
> Venables and Ripley's discussion of this in MASS4 and other sources  
> that
> discuss this warning and R (Faraway's Extending the Linear Model  
> with R
> and John Fox's new Applied Regression, Generalized Linear Models, and
> Related Methods, 2nd Ed) as well as some of the literature on Firth's
> bias reduction method. But I am still somewhat unsure what
> (quasi-)separation is and if this is the reason for the warnings in  
> this
> case.
>
> My question then is, is this a separation issue with my data, or is it
> quasi-separation that I have read a bit about whilst researching this
> problem? Or is this something completely different?
>
> Code to reproduce my problem with the actual data is given below. I'd
> appreciate any comments or thoughts on this.
>
> #### Begin code snippet  
> ################################################
>
> ## note data file is ~93Kb in size
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/ 
> dat.csv"))
> head(dat)
> ## fit model --- produces warning
> mod <- glm(analogs ~ Dij, data = dat, family = binomial)
> ## plot the data
> plot(analogs ~ Dij, data = dat)
> fit.mod <- fitted(mod)
> ord <- with(dat, order(Dij))
> with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))
>
> #### End code snippet  
> ##################################################
>
> Thanks in advance
>
> Gavin
> -- 
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
> Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.