[R] modeling binary response variables

Daniel Malter daniel at umd.edu
Tue Jul 15 03:07:15 CEST 2008


Hi Kevin, you mean an s-shaped relationship of a variable with your response?
So you have a response that is strictly constrained to the interval 0,1 or,
and these limits are not due to truncation or censoring (i.e. your response
variable is truly a proportion).

This sounds like a good application for a binomial model as fitting a linear
model may give you a fit outside the limits of the interval that you are
allowed to observe (0,1). The binomial logit (or probit, or cloglog) fixes
that issue.

Since you have a proportion (the probability of success), you have something
between 0 and 1. I suggest you to transform that by multiplying that
proportion by say 100 (or 1000). Then you round this value to the next
integer. Say Y is currently your proportion, do new.Y=round(Y*100). Then you
create the number of observations that make up the counter-probability of
your observation. counter.Y=100-Y.

Then you can run the binomial as follows:

reg=glm(cbind(new.Y,counter.Y)~predictors,binomial) ##runs the regression
summary(reg) ##shows the summary output of your regression
fitted(reg) ##shows the predicted values given your data matrix and your
estimated model

You will want to check a.) whether you need a binomial (if your
probabilities are actually reasonably distributed in a much smaller interval
than 0,1, then you may be okay with a linear model).
b.) if a binomial is more appropriate, you will want to check whether your
data is overdispersed. Look at whether your degrees of freedom in the
summary of your model are about equal to the log-likelihood of the model. If
not, choose option quasibinomial instead of option binomial when fitting the
model.

Best,
Daniel



Kevin J Emerson wrote:
> 
> R-devotees,
> 
> I have a question about modeling in the case where the response variable
> is
> binary.
> 
> I have a case where I have a response variable that is the probability of
> success, and four descriptor variables, The response has a sigmoid
> response
> with one of the variables. I would like to test for the effect of the
> various descriptor variables on the percentage success of the binary
> trait.
> I have looked at glm with family = "binomial" but am not sure I totally
> understand its use (and therefore am not sure it is the appropriate test)
> and am looking for two things: (1) is glm with family = 'binomial' the
> right
> way to do this, and (2) are there any good references on how it works.
> I have posted a plot of a sample of the data I am looking at as well as
> the
> sample data used to generate the plots.
> 
> Sample Plot: http://www.uoregon.edu/~kemerson/tmp/plot.pdf
> Sample Data: http://www.uoregon.edu/~kemerson/tmp/data.csv
> 
> Response variable is percent.dev (se2.dev are the errors from binomial
> estimates given probability and number of samples).
> 
> Descriptor variables are num.days, ppd, temp, and pop.  
> 
> Any help would be greatly appreciated.
> 
> Cheers,
> Kevin Emerson
> 
> 
> ====================================
> Kevin J. Emerson
> Bradshaw - Holzapfel Lab
> 1210 University of Oregon
> Eugene, OR, 97403
> email: kemerson at uoregon.edu
> web: http://evodevo.uoregon.edu/people/emerson.html
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/modeling-binary-response-variables-tp18456116p18456275.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list