[R] OT: (quasi-?) separation in a logistic GLM

Wed Dec 17 17:44:28 CET 2008

Gavin:

I think its important to point out two probably obvious things
(1) the dataset is very imbalanced...we have an overabundance of 
'analogs==FALSE' roughly 94% of the data.  If we think of a purely 
non-parametric test of equality of the underlying CDF's then we have 
alot of confidence in F0 and not much in F1

(2) this aside, it does appear that the Dij value 0.209 seems to be 
   the optimum from the standpoint of maximizing Youdin's index
  Se + Sp - 1   which is the expected utility assigning utilities +/- 1/pi and 
+/- 1/(1-pi) to True/False positives and negatives...
meaning in this case, a true/false positive is worth 0.94/0.06 the
value of a true/false negative which seems reasonable given the imbalance of 
the dataset and the expectation that the measurements are equally precise in 
the two populations.

x <- 0.209; with(dat, c(sp <- mean(Dij[!analogs]>x), se<- 
mean(Dij[analogs]<=x), sp+se - 1))
[1] 0.9443561 0.9269231 0.8712792

So it appears that the dataset is quite well separated into two samples at the 
cutpoint 0.209   Re: [R] OT: (quasi-?) separation in a logistic GLM

Grant Izmirlian
NCI

On 15 Dec 2008, at 18:03, Gavin Simpson wrote:

> Dear List,
>
> Apologies for this off-topic post but it is R-related in the sense  
> that
> I am trying to understand what R is telling me with the data to hand.
>
> ROC curves have recently been used to determine a dissimilarity
> threshold for identifying whether two samples are from the same "type"
> or not. Given the bashing that ROC curves get whenever anyone asks  
> about
> them on this list (and having implemented the ROC methodology in my
> analogue package) I wanted to try directly modelling the probability
> that two sites are analogues for one another for given dissimilarity
> using glm().
>
> The data I have then are a logical vector ('analogs') indicating  
> whether
> the two sites come from the same vegetation and a vector of the
> dissimilarity between the two sites ('Dij'). These are in a csv file
> currently in my university web space. Each 'row' in this file
> corresponds to single comparison between 2 sites.
>
> When I analyse these data using glm() I get the familiar "fitted
> probabilities numerically 0 or 1 occurred" warning. The data do not  
> look
> linearly separable when plotted (code for which is below). I have read
> Venables and Ripley's discussion of this in MASS4 and other sources  
> that
> discuss this warning and R (Faraway's Extending the Linear Model  
> with R
> and John Fox's new Applied Regression, Generalized Linear Models, and
> Related Methods, 2nd Ed) as well as some of the literature on Firth's
> bias reduction method. But I am still somewhat unsure what
> (quasi-)separation is and if this is the reason for the warnings in  
> this
> case.
>
> My question then is, is this a separation issue with my data, or is it
> quasi-separation that I have read a bit about whilst researching this
> problem? Or is this something completely different?
>
> Code to reproduce my problem with the actual data is given below. I'd
> appreciate any comments or thoughts on this.
>
> #### Begin code snippet  
> ################################################
>
> ## note data file is ~93Kb in size
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/ 
> dat.csv"))
> head(dat)
> ## fit model --- produces warning
> mod <- glm(analogs ~ Dij, data = dat, family = binomial)
> ## plot the data
> plot(analogs ~ Dij, data = dat)
> fit.mod <- fitted(mod)
>> ord <- with(dat, order(Dij))
> with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))
>
> #### End code snippet  
> ##################################################
>
> Thanks in advance
>
> Gavin
> -- 
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
> Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.