[R-sig-eco] Logistic regression with 2 categorical predictors

Gavin Simpson ucfagls at gmail.com
Fri Oct 24 16:32:09 CEST 2014


Hi Andrew,

On 24 October 2014 01:41, Andrew Halford <andrew.halford at gmail.com> wrote:

> Dear Gavin,
>
> Firstly let me say that I take offence at your "bogus" comment. Just
> because I, like many others who interact on this list, often struggle
> conceptually with the overwhelming analysis choices that are required in
> our line of work doesn't give you the right to drop snide remarks as you
> see fit.
>

Sorry about that; I meant bogus in the sense of its synonym; spurious. I
overlooked that it also had other synonyms could could easily be
interpreted as you have, as me suggesting something nefarious was being
done. That was *not* my intention I do apologise for that as I have clearly
caused some offence where none was intended.

So perhaps I should have said, the results you show are spurious; don't
both interpreting the model because you are overfitting the data - in fact
you are fitting it perfectly (to within numeric precision anyway).


> My line of query is ALWAYS genuine from my perspective and I don't expect
> you or anyone else to belittle people on the public list!
>
> As it turns out my issues are not resolved.
>
> To recap..
>
> I have run a bunch of choice chamber experiments with larval fish.
> Graphing up the ratio of 0/1 choices produces a plot which shows to my eye,
> evidence of a result for some of the tests, with fish appearing to make
> defined choices in the later age groups for 2 of the tests.
>
> What appears to be happening is that because there are some empty cells in
> the later age x test interactions (the fish only took one option to the
> exclusion of the other) the errors are way out and hence preclude any
> chance of getting a significant result. If I add a single result to any of
> the zero cells to remove the "blank" the analysis actually works more as I
> hoped. However I doubt this is acceptable so I am hoping to get some help
> with producing an effective analysis without having to manipulate the blank
> cells.
>

I'm confused; how would you go about "manipulating" blank cells? If you
have more data use it - you certainly can't fit the full model (or two main
effects and their interaction) with the data you currently have. The model
is saturated in that you've fitted as many coefficients as there are data
points; you've replaced the existing response data with a vector of the
same length containing the estimates of the coefficients from the model. In
a sense, you've just transformed your response through a complex procedure.
Nothing else.

As such, you have no basis for then interpreting the coefficients or doing
pairwise comparisons. The model you are fitting is just too complex.

This can happen in experiments where there is sufficient replication of the
levels of the factors and there combinations. Which seems to be what has
happened here because of those darned stubborn fish.

I hope I've done a better of job of i) not annoying you with poorly chosen
words, and ii) explaining why I think you should stop with the current
model as is it is too complex for your data. You can;t test for an
interaction but you could remove the interaction and just test for main
effects, unless you can get some more data.

G


>
>
> Andrew
>
>
>
>
> On 24 October 2014 04:08, Gavin Simpson <ucfagls at gmail.com> wrote:
>
>> This all looks bogus to me; you've fit the data perfectly by fitting a
>> saturated model - there are no residual degrees of freedom and
>> (effectively) zero residual deviance. Things are clearly amiss because you
>> have huge standard errors. You have 24 data points and fit a model with 23
>> coefficient plus the intercept; you just replaced your data with 24 new
>> data points (the values in the Estimate column of the summary() output)
>>
>> I really wouldn't bother interpreting it any further.
>>
>> HTH
>>
>> G
>>
>> On 21 October 2014 18:21, Andrew Halford <andrew.halford at gmail.com>
>> wrote:
>>
>>> Hi Thierry,
>>>
>>> The multiple comparisons ran just fine but there was a ridiculous amount
>>> of
>>> interaction combinations all of which were non-significant even though
>>> there was a highly significant interaction term. I decided to remove test
>>> as a variable to simplify the analysis and run separate single
>>> explanatory
>>> variable logistic regressions. I have included a result below which is
>>> still producing an outcome I cant explain. Namely, why am I getting such
>>> a
>>> significant result for the ANOVA but when I do the tukey tests nothing is
>>> significant?
>>>
>>> > sg_habitat
>>>   Age Prefer Avoid
>>> 1   1     17    14
>>> 2   2     20    10
>>> 3   3     14     9
>>> 4   4     13    12
>>> 5   5      0    18
>>> 6   6      0     5
>>>
>>> > model_sg <- glm(cbind(Prefer,Avoid) ~ Age, data=sg_habitat,
>>> family=binomial)
>>>
>>> > anova(model_sg, test="Chisq")
>>>
>>> Analysis of Deviance Table
>>>
>>> Model: binomial, link: logit
>>>
>>> Response: cbind(Prefer, Avoid)
>>>
>>> Terms added sequentially (first to last)
>>>
>>>
>>>      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
>>> NULL                     5     36.588
>>> Age   5   36.588         0      0.000 7.243e-07 ***
>>>
>>>
>>> > mc_sg <- glht(model_sg, mcp(Age = "Tukey"))
>>>
>>> > summary(mc_sg)
>>>
>>>          Simultaneous Tests for General Linear Hypotheses
>>>
>>> Multiple Comparisons of Means: Tukey Contrasts
>>>
>>>
>>> Fit: glm(formula = cbind(Prefer, Avoid) ~ Age, family = binomial,
>>>     data = sg_habitat)
>>>
>>> Linear Hypotheses:
>>>              Estimate Std. Error z value Pr(>|z|)
>>> 2 - 1 == 0     0.4990     0.5294   0.943    0.912
>>> 3 - 1 == 0     0.2477     0.5593   0.443    0.997
>>> 4 - 1 == 0    -0.1141     0.5390  -0.212    1.000
>>> 5 - 1 == 0   -25.8473 53178.5362   0.000    1.000
>>> 6 - 1 == 0   -24.7307 57729.9299   0.000    1.000
>>> 3 - 2 == 0    -0.2513     0.5767  -0.436    0.997
>>> 4 - 2 == 0    -0.6131     0.5570  -1.101    0.844
>>> 5 - 2 == 0   -26.3463 53178.5362   0.000    1.000
>>> 6 - 2 == 0   -25.2296 57729.9299   0.000    1.000
>>> 4 - 3 == 0    -0.3618     0.5855  -0.618    0.985
>>> 5 - 3 == 0   -26.0950 53178.5362   0.000    1.000
>>> 6 - 3 == 0   -24.9783 57729.9299   0.000    1.000
>>> 5 - 4 == 0   -25.7332 53178.5362   0.000    1.000
>>> 6 - 4 == 0   -24.6165 57729.9299   0.000    1.000
>>> 6 - 5 == 0     1.1167 78490.1364   0.000    1.000
>>> (Adjusted p values reported -- single-step method)
>>>
>>>
>>> On 21 October 2014 22:53, ONKELINX, Thierry <Thierry.ONKELINX at inbo.be>
>>> wrote:
>>>
>>> > Hi Andrew,
>>> >
>>> > Please keep the mailing list in cc.
>>> >
>>> > The estimates in mc are the differences of the parameter estimates
>>> (betas)
>>> > between both levels. E.g. 5.LR -1.LR = -1.168 or 5.LR = 1.LR - 1.168
>>> >
>>> > summary(mc) should give you the significance of those differences. That
>>> > should work. If it doesn't, please provide more info: at least your
>>> code
>>> > and the error message. A small reproducible example is better.
>>> > confint(mc) gives the confidence intervals of the differences.
>>> >
>>> > Best regards,
>>> >
>>> > Thierry
>>> >
>>> > ir. Thierry Onkelinx
>>> > Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>>> and
>>> > Forest
>>> > team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
>>> > Kliniekstraat 25
>>> > 1070 Anderlecht
>>> > Belgium
>>> > + 32 2 525 02 51
>>> > + 32 54 43 61 85
>>> > Thierry.Onkelinx at inbo.be
>>> > www.inbo.be
>>> >
>>> > To call in the statistician after the experiment is done may be no more
>>> > than asking him to perform a post-mortem examination: he may be able
>>> to say
>>> > what the experiment died of.
>>> > ~ Sir Ronald Aylmer Fisher
>>> >
>>> > The plural of anecdote is not data.
>>> > ~ Roger Brinner
>>> >
>>> > The combination of some data and an aching desire for an answer does
>>> not
>>> > ensure that a reasonable answer can be extracted from a given body of
>>> data.
>>> > ~ John Tukey
>>> >
>>> > Van: Andrew Halford [mailto:andrew.halford at gmail.com]
>>> > Verzonden: dinsdag 21 oktober 2014 16:19
>>> > Aan: ONKELINX, Thierry
>>> > Onderwerp: Re: [R-sig-eco] Logistic regression with 2 categorical
>>> > predictors
>>> >
>>> > Hi Thierry,
>>> > Thanks for the response. I have run your code but it seems you cant
>>> call
>>> > the summary function, you just have to call the object on its own i.e.
>>> mc.
>>> > The results are I get are below but I am not sure how to interpret
>>> these,
>>> > exactly what does the estimate represent? How do I measure significance
>>> > here?
>>> >
>>> >                    Estimate
>>> > 2.LR - 1.LR == 0  1.252e-01
>>> > 3.LR - 1.LR == 0 -5.390e-01
>>> > 4.LR - 1.LR == 0  1.802e-02
>>> > 5.LR - 1.LR == 0 -1.168e+00
>>> > 6.LR - 1.LR == 0 -2.575e+01
>>> > 1.SD - 1.LR == 0  7.411e-02
>>> > 2.SD - 1.LR == 0 -2.408e-01
>>> > 3.SD - 1.LR == 0  2.675e-01
>>> > etc etc
>>> >
>>> > Andy
>>> >
>>> > On 20 October 2014 23:04, ONKELINX, Thierry <Thierry.ONKELINX at inbo.be>
>>> > wrote:
>>> > Dear Andrew,
>>> >
>>> > anova() and summary() test different hypotheses. anova() tests is at
>>> least
>>> > one level is different from the others. summary() tests if the
>>> coefficient
>>> > is different from zero.
>>> >
>>> > Multiple comparison of different interaction levels is probably the
>>> most
>>> > relevant in this case. The easiest way is to make a new variable.
>>> >
>>> > snapper2$inter <- with(snapper2, interaction(age, test))
>>> > model <- glm(cbind(prefer,avoid) ~ 0 + inter, data=snapper2,
>>> > family=binomial)
>>> > library(multcomp)
>>> > mc <- glht(model, mcp(inter = "Tukey))
>>> > summary(mc)
>>> >
>>> > Best regards,
>>> >
>>> > ir. Thierry Onkelinx
>>> > Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>>> and
>>> > Forest
>>> > team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
>>> > Kliniekstraat 25
>>> > 1070 Anderlecht
>>> > Belgium
>>> > + 32 2 525 02 51
>>> > + 32 54 43 61 85
>>> > Thierry.Onkelinx at inbo.be
>>> > www.inbo.be
>>> >
>>> > To call in the statistician after the experiment is done may be no more
>>> > than asking him to perform a post-mortem examination: he may be able
>>> to say
>>> > what the experiment died of.
>>> > ~ Sir Ronald Aylmer Fisher
>>> >
>>> > The plural of anecdote is not data.
>>> > ~ Roger Brinner
>>> >
>>> > The combination of some data and an aching desire for an answer does
>>> not
>>> > ensure that a reasonable answer can be extracted from a given body of
>>> data.
>>> > ~ John Tukey
>>> >
>>> >
>>> > -----Oorspronkelijk bericht-----
>>> > Van: r-sig-ecology-bounces at r-project.org [mailto:
>>> > r-sig-ecology-bounces at r-project.org] Namens Andrew Halford
>>> > Verzonden: maandag 20 oktober 2014 16:06
>>> > Aan: r-sig-ecology at r-project.org
>>> > Onderwerp: [R-sig-eco] Logistic regression with 2 categorical
>>> predictors
>>> >
>>> > Hi Listers,
>>> >
>>> > I am trying to run a logistic regression to look at the effects of
>>> > experiment type and age on the behavior of fish in a choice chamber
>>> > experiment.
>>> >
>>> > I am using the glm approach and would like some advice on how or
>>> whether
>>> > to perform contrasts to work out what levels of Factor1 (Age) and
>>> Factor 2
>>> > (Test) are significantly different from each other. I have not been
>>> able
>>> > to clarify from my reading what is the appropriate approach to take
>>> when
>>> > dealing with a significant interaction term. I am also not sure as to
>>> how
>>> > one interprets a model when all the coefficients are non-significant
>>> but
>>> > the chi-square ANOVA shows a highly significant interaction term.
>>> >
>>> > I have graphed up the data as dot plots and there is definitely
>>> evidence
>>> > of changes in proportions in later ages.
>>> >
>>> > I want to provide evidence for when and for which tests there was a
>>> > 'significant' change in behavior.
>>> >
>>> > > snapper2
>>> >    age test prefer avoid
>>> > 1    1   LR     15    14
>>> > 2    1   SD     15    13
>>> > 3    1   SG     17    14
>>> > 4    1   SW     14    14
>>> > 5    2   LR     17    14
>>> > 6    2   SD     16    19
>>> > 7    2   SG     20    10
>>> > 8    2   SW     15    21
>>> > 9    3   LR     10    16
>>> > 10   3   SD     14    10
>>> > 11   3   SG     14     9
>>> > 12   3   SW     13    15
>>> > 13   4   LR     12    11
>>> > 14   4   SD     14    11
>>> > 15   4   SG     13    12
>>> > 16   4   SW     11    14
>>> > 17   5   LR      4    12
>>> > 18   5   SD      8     8
>>> > 19   5   SG      0    18
>>> > 20   5   SW     10     6
>>> > 21   6   LR      0     6
>>> > 22   6   SD      3     4
>>> > 23   6   SG      0     5
>>> > 24   6   SW      5     3
>>> >
>>> >  >
>>> >
>>> >
>>> dotplot(age~prefer/avoid,data=snapper2,group=snapper2$test,cex=1.5,pch=19,ylab="age",auto.key=list(space="right",title="Tests"))
>>> >
>>> >
>>> > > out2 <- glm(cbind(prefer,avoid) ~ age*test, data=snapper2,
>>> > family=binomial)
>>> >
>>> > > summary(out2)
>>> >
>>> > Call:
>>> > glm(formula = cbind(prefer, avoid) ~ age * test, family = binomial,
>>> >     data = snapper2)
>>> >
>>> > Deviance Residuals:
>>> >  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>>> 0  0
>>> > 0
>>> >
>>> > Coefficients:
>>> >               Estimate Std. Error z value Pr(>|z|)
>>> > (Intercept)  6.899e-02  3.716e-01   0.186   0.8527
>>> > age2         1.252e-01  5.180e-01   0.242   0.8091
>>> > age3        -5.390e-01  5.483e-01  -0.983   0.3256
>>> > age4         1.802e-02  5.589e-01   0.032   0.9743
>>> > age5        -1.168e+00  6.866e-01  -1.701   0.0890 .
>>> > age6        -2.575e+01  9.348e+04   0.000   0.9998
>>> > testSD       7.411e-02  5.307e-01   0.140   0.8890
>>> > testSG       1.252e-01  5.180e-01   0.242   0.8091
>>> > testSW      -6.899e-02  5.301e-01  -0.130   0.8964
>>> > age2:testSD -4.401e-01  7.260e-01  -0.606   0.5444
>>> > age3:testSD  7.324e-01  7.846e-01   0.933   0.3506
>>> > age4:testSD  8.004e-02  7.863e-01   0.102   0.9189
>>> > age5:testSD  1.024e+00  9.301e-01   1.102   0.2707
>>> > age6:testSD  2.532e+01  9.348e+04   0.000   0.9998
>>> > age2:testSG  3.738e-01  7.407e-01   0.505   0.6138
>>> > age3:testSG  7.867e-01  7.832e-01   1.004   0.3152
>>> > age4:testSG -1.321e-01  7.764e-01  -0.170   0.8649
>>> > age5:testSG -2.568e+01  8.768e+04   0.000   0.9998
>>> > age6:testSG  2.121e-02  1.334e+05   0.000   1.0000
>>> > age2:testSW -4.616e-01  7.249e-01  -0.637   0.5242
>>> > age3:testSW  3.959e-01  7.662e-01   0.517   0.6054
>>> > age4:testSW -2.592e-01  7.858e-01  -0.330   0.7415
>>> > age5:testSW  1.678e+00  9.386e-01   1.788   0.0737 .
>>> > age6:testSW  2.626e+01  9.348e+04   0.000   0.9998
>>> > ---
>>> > Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>> >
>>> > (Dispersion parameter for binomial family taken to be 1)
>>> >
>>> >     Null deviance: 5.4908e+01  on 23  degrees of freedom Residual
>>> > deviance: 2.6113e-10  on  0  degrees of freedom
>>> > AIC: 122.73
>>> >
>>> > Number of Fisher Scoring iterations: 23
>>> >
>>> >
>>> > > anova(out2, test="Chisq")
>>> >
>>> > Analysis of Deviance Table
>>> >
>>> > Model: binomial, link: logit
>>> >
>>> > Response: cbind(prefer, avoid)
>>> >
>>> > Terms added sequentially (first to last)
>>> >
>>> >
>>> >          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
>>> > NULL                        23     54.908
>>> > age       5   11.235        18     43.673 0.0469115 *
>>> > test      3    1.593        15     42.079 0.6608887
>>> > age:test 15   42.079         0      0.000 0.0002185 ***
>>> > ---
>>> > Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>> >
>>> > cheers
>>> >
>>> > Andy
>>> >         [[alternative HTML version deleted]]
>>> >
>>> > _______________________________________________
>>> > R-sig-ecology mailing list
>>> > R-sig-ecology at r-project.org
>>> > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>> > * * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * * * * *
>>> > Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver
>>> > weer en binden het INBO onder geen enkel beding, zolang dit bericht
>>> niet
>>> > bevestigd is door een geldig ondertekend document.
>>> > The views expressed in this message and any annex are purely those of
>>> the
>>> > writer and may not be regarded as stating an official position of
>>> INBO, as
>>> > long as the message is not confirmed by a duly signed document.
>>> >
>>> >
>>> >
>>> > --
>>> > Andrew Halford Ph.D
>>> > Research Scientist (Kimberley Marine Parks)|  Adjunct Research
>>> Scientist
>>> > (Curtin University)
>>> > Dept. Parks and Wildlife
>>> > Western Australia
>>> >
>>> > Ph: +61 8 9219 9795
>>> > Mobile: +61 (0) 468 419 473
>>> > * * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * * * * *
>>> > Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver
>>> > weer en binden het INBO onder geen enkel beding, zolang dit bericht
>>> niet
>>> > bevestigd is door een geldig ondertekend document.
>>> > The views expressed in this message and any annex are purely those of
>>> the
>>> > writer and may not be regarded as stating an official position of
>>> INBO, as
>>> > long as the message is not confirmed by a duly signed document.
>>> >
>>>
>>>
>>>
>>> --
>>> Andrew Halford Ph.D
>>> Research Scientist (Kimberley Marine Parks)|  Adjunct Research Scientist
>>> (Curtin University)
>>> Dept. Parks and Wildlife
>>> Western Australia
>>>
>>> Ph: +61 8 9219 9795
>>> Mobile: +61 (0) 468 419 473
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-ecology mailing list
>>> R-sig-ecology at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>>
>>
>>
>>
>> --
>> Gavin Simpson, PhD
>>
>
>
>
> --
> Andrew Halford Ph.D
> Research Scientist (Kimberley Marine Parks)|  Adjunct Research Scientist
> (Curtin University)
> Dept. Parks and Wildlife
> Western Australia
>
> Ph: +61 8 9219 9795
> Mobile: +61 (0) 468 419 473
>



-- 
Gavin Simpson, PhD

	[[alternative HTML version deleted]]



More information about the R-sig-ecology mailing list