[R] interpretation of p values for highly correlated logistic	analysis
    claus orourke 
    claus.orourke at gmail.com
       
    Wed Mar 31 13:38:17 CEST 2010
    
    
  
Dear list,
I want to perform a logistic regression analysis with multiple
categorical predictors (i.e., a logit) on some data where there is a
very definite relationship between one predicator and the
response/independent variable. The problem I have is that in such a
case the p value goes very high (while I as a naive newbie would
expect it to crash towards 0).
I'll illustrate my problem with some toy data. Say I have the
following data as an input frame:
   roman animal colour
1  alpha    dog black
2   beta    cat white
3  alpha    dog black
4  alpha    cat black
5   beta    dog white
6  alpha    cat black
7  gamma    dog white
8  alpha    cat black
9  gamma    dog white
10  beta    cat white
11 alpha    dog black
12 alpha    cat black
13 gamma    dog white
14 alpha    cat black
15  beta    dog white
16  beta    cat black
17 alpha    cat black
18  beta    dog white
In this toy data you can see that roman:alpha and roman:beta are
pretty good predictors of colour
Let's say I perform logistic analysis directly on the raw data with
colour as a response variable:
> options(contrasts=c("contr.treatment","contr.poly"))
> anal1 <- glm(data$colour~data$roman+data$animal,family=binomial)
then I find that my P values for each individual level coefficient approach 1:
Coefficients:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)       -41.65   19609.49  -0.002    0.998
data$romanbeta     42.35   19609.49   0.002    0.998
data$romangamma    43.74   31089.48   0.001    0.999
data$animaldog     20.48   13866.00   0.001    0.999
while I expect the p value for roman:beta to be quite low because it
is a good predictor of colour:white
On the other hand, if I then run an anova with a Chi-sq test on the
result model, I find as I would expect that 'roman' is a good
predictor of colour.
> anova(anal1,test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: data$colour
Terms added sequentially (first to last)
            Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                           17    24.7306
data$roman   2  19.3239        15     5.4067 6.366e-05 ***
data$animal  1   1.5876        14     3.8191    0.2077
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
Can anyone please explain why my p value is so high for the individual levels?
Sorry for what is likely a stupid question.
Claus
p.s., when I run logistic analysis on data that is more 'randomised'
everything comes out as I expect.
    
    
More information about the R-help
mailing list