[R-sig-eco] Logistic regression with 2 categorical predictors

Fri Oct 24 16:29:11 CEST 2014

Andrew,

Your problems arise from two issues: 
1) The overall test of equal P[prefer] in each age and the tests of equal P[prefer] in a specific pair of ages  are based on different testing philosophies (Score or LR test for the overall test and Wald test for the pairwise comparisons).
2) The behaviour of the logit function when one empirical probability is 0 (or 1).
You are absolutely correct when you highlight the ages with all avoid.  

If you want  just my suggestion for how to conduct informative pairwise comparisons: Do 15 two-by-two Chi-square tests, or 15 runs of glm with only two ages in each analysis.  Because there is no pooling of information from ages uninvolved in the null hypothesis, these 2x2 tests are exactly the same as the pairwise hypothesis for the full 6-age data.  I don't know whether there is a R packge to automate this. I've always written a pair of for loops and used subset= with the appropriate logical in the glm().

Details (which are a very very small glimpse of a huge area):  Fleiss, analysis of counts and proportions, and the tome: Agresti, categorical data analysis are extended treatments).  It doesn't help that the analysis of this type of data has some non-trivial complications.  Upton, in his classic JRSS discussion paper on the 2x2 contingency table, said something like the 2x2 contingency table was a seemingly simple problem that had a surprisingly large number of complications.

1) Why is a 2x2 contingency table comparison of ages 1 and 2 the same as the pairwise comparison of ages 1 and 2 in an analysis of all 6 ages.  When the data are normally distributed and analyzed by ANOVA, the error variance is (usually) estimated from all 6 groups.  The analysis of just two groups estimates the error variance from fewer observations, hence the 6-group pooled error variance provides improved power for the comparison of groups 1 and 2.  The null hypothesis in both cases specifies equality of ages 1 and 2, but all 6 groups provide information about the error variance.  With count data, there is no error variance that has to be estimated.  (I'm not considering overdispersion, which if present would be estimated from all 6 groups, so the 2x2 and "6x2 followed by a pairwise comparison" analyses are no longer equivalent).  Again, the null hypothesis is the same in both analyses.  One way to consider the specific comparison of ages 1 and 2 in the 6-age analysis is that you are fitting a model with 5 probabilities to the data: one probability that is shared by ages 1 and 2 (hence they have equal probabilities) and four more probabilities, one each for ages 3 through 6.   The fit of that 5 parameter model is compared to the fit of a 6 parameter model in which each age has its own probability.  Both the 2 x 2 and the pairwise comparison are 1 df tests.  In neither case, can the comparison of ages 1 and 2 benefit from any information from the other four groups.

2) What's going on the anova(model_sg, test="Chisq") analysis?  This is a comparison of two models.  The data and the models can be set up in at least three asymptotically equivalent ways (see Agresti and probably Fleiss for "sampling models for the contingency table" for the details).  I'll use the independent binomial model.  That model has up to 6 parameters, i.e. a P[prefer] for each age, where P[prefer] is the probability that a fish in a specified age group has a "Prefer" response, not an "Avoid" response.  The null hypothesis (reduced) model is that each age has the same probability, so there is one parameter.  The alternative (full) model is that each age has a different probability.  Yes, the alternative model is fully specified.  Yes, the alternative model has 0 "error" df.  When you fit these models using glm, you are using deviance = -2 log likelihood as the measure of fit.  Yes, the deviance is 0 for the full model (under the usual way of writing the log likelihood).  That is exactly how things should be.  The comparison between these two models is a measure of whether the reduced model fits the data.  (semi-Aside: You can write the log-likelihood for the full model so the deviance is not 0.  If so, the log-likelihoods for both the null and alternative models are shifted by the same additive constant so the change in deviance is the same value for either way to write the log-likelihood).  This comparison of models is exactly the same comparison done by the usual (K-1) df Chi-square test of equal proportions in a Kx2 contingency table.  The only difference is that the Chi-square test uses the Pearson Chi-square statistic as the measure of how well  a model fits the data; glm uses deviance.  We regularly calculate the Chi-square statistic for the null hypothesis.  We never explicitly calculate Chi-square for the alternative hypothesis, because Chi-square for the full model is exactly 0 with exactly 0 df.  The change between the two models is the usual Chi-square test statistic.

3) Why are counts of 0 a problem for an analysis using logit probabilities?  If you fit a glm(cbind(Prefer,Avoid) ~ -1+Age, ...) you would fit a 6-parameter model with one parameter for each group, without having to worry about R's full rank contrast parameterization.  The problems with 0's would be immediately apparent.  The parameters for ages 5 and 6 would be -25 or so.  That is a numerical approximation, the estimates should be -infinity.  That's because logit(0) = log [0/(1-0) ] = -infinity.  Any value from ca -10 to -1000000 will give a probability that is essentially 0.  The se's of those estimates are huge, which is no longer surprising, because any value in a huge range of large negative numbers fits equally well (because all transform back to a probability essentially 0; I don't think there is much practical difference between P[prefer]= 1E-30 or 1E-10 or even 0.00001 when the number of individuals is between 5 and 30).

If the estimate for age 5 is poor, the comparison of that estimate to that for age 1 will be poor.  Hence the large se's for any comparison to age 5 or age 6.  The problem is the testing philosophy, not the biological value of the test.  The pairwise comparisons between specific groups use the estimates and their standard errors under the alternative hypothesis (Wald test).  The overall (5 df) test uses the change in deviance between models (glm, likelihood ratio, LR test) or the fit of the data under the null hypothesis (Chi-square test, a score test).

If you really have to fit a model with probabilities very close to or equal to 0, there are ways to get better-behaved estimates, e.g. using Firth's penalized likelihood or a Bayesian approach.  I don't believe that is necessary here.

As I said earlier, my practical solution would be to use a series of 2x2 contingency table analyses.  You won't be able to compare groups 5 and 6, but you shouldn't want to do any statistical test there.  If you want a multiple-comparisons adjustment, multiply all the usual p-values by 15, which is a Bonferroni correction for all pairs of 6 groups.  That approach answers your biological and avoids comparing age-specific estimates.

My apologies for the long post to the group.  It couldn't be short because the issues are multiple, not trivial, and needed to be raised.  I would be happy to have a short further discussion with you, Andrew, if you want, but I suggest that be off the list. 

Philip Dixon