[R] FW: logistic regression

Sat Sep 27 16:22:05 CEST 2008

Darin Brooks wrote:
> Sorry.
> 
> Let me try again then.
> 
> I am trying to find "significant" predictors" from a list of about 44
> independent variables.  So I started with all 44 variables and ran
> drop1(sep22lr, test="Chisq")... and then dropped the highest p value from
> the run.  Then I reran the drop1.  
> 
> Model:
> MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_MST_1 + 
>     SOIL_NUTR + cE + cN + cELEV + cDIAM_125 + cCRCLS + cCULM_125 + 
>     cSPH + cAGE + cVRI_NONPINE + cVRI_nonpineCFR + cVRI_BLEAF + 
>     cvol_125 + cstrDST_SW + cwaterDST_SW + cSEEDSRCE_SW + cMAT + 
>     cMWMT + cMCMT + cTD + cMAP + cMSP + cAHM + cSHM + cMATMAP + 
>     cddless0 + cddless18 + cddgrtr0 + cddgrtr18 + cNFFD + cbFFP + 
>     ceFFP + cPAS + cDD5_100 + cEXT_Cold + cS_INDX
>                 Df Deviance    AIC    LRT   Pr(Chi)    
> <none>               814.21 938.21                     
> ORG_CODE         4   824.97 940.97  10.76 0.0294100 *  
> BECLBL08         9   845.61 951.61  31.41 0.0002519 ***
> PEM_SScat       10   829.11 933.11  14.90 0.1357580    
> SOIL_MST_1       1   814.63 936.63   0.43 0.5135094    
> SOIL_NUTR        2   818.49 938.49   4.28 0.1175411    
> cE               1   814.37 936.37   0.16 0.6886085    
> cN               1   814.40 936.40   0.20 0.6566765    
> cELEV            1   814.35 936.35   0.14 0.7044864    
> cDIAM_125        1   817.98 939.98   3.78 0.0519554 .  
> cCRCLS           1   819.32 941.32   5.11 0.0237598 *  
> cCULM_125        1   816.17 938.17   1.97 0.1606846    
> cSPH             1   816.62 938.62   2.41 0.1204141    
> cAGE             1   815.92 937.92   1.72 0.1902314    
> cVRI_NONPINE     1   818.04 940.04   3.84 0.0501149 .  
> cVRI_nonpineCFR  1   821.17 943.17   6.96 0.0083197 ** 
> cVRI_BLEAF       1   818.78 940.78   4.58 0.0324286 *  
> cvol_125         1   814.67 936.67   0.47 0.4949495    
> cstrDST_SW       1   814.63 936.63   0.42 0.5169757    
> cwaterDST_SW     1   814.75 936.75   0.55 0.4592643    
> cSEEDSRCE_SW     1   817.73 939.73   3.53 0.0604234 .  
> cMAT             1   814.27 936.27   0.06 0.8002333    
> cMWMT            1   814.49 936.49   0.28 0.5942246    
> cMCMT            1   819.39 941.39   5.18 0.0228425 *  
> cTD              1   816.20 938.20   1.99 0.1580332    
> cMAP             1   814.25 936.25   0.04 0.8386626    
> cMSP             1   818.41 940.41   4.20 0.0404411 *  
> cAHM             1   815.66 937.66   1.46 0.2276311    
> cSHM             1   819.95 941.95   5.75 0.0165227 *  
> cMATMAP          1   814.91 936.91   0.71 0.4001878    
> cddless0         1   818.04 940.04   3.83 0.0502153 .  
> cddless18        1   817.81 939.81   3.60 0.0576931 .  
> cddgrtr0         1   816.64 938.64   2.44 0.1184235    
> cddgrtr18        1   815.77 937.77   1.57 0.2104958    
> cNFFD            1   815.38 937.38   1.18 0.2782582    
> cbFFP            1   814.39 936.39   0.18 0.6677481    
> ceFFP            1   820.22 942.22   6.01 0.0141863 *  
> cPAS             1   814.21 936.21   0.01 0.9347654    
> cDD5_100         1   814.79 936.79   0.58 0.4447531    
> cEXT_Cold        1   816.99 938.99   2.78 0.0954512 .  
> cS_INDX          1   815.21 937.21   1.01 0.3157208    
> 
> 
> And then systematically reran the drop1, removing the HIGHEST p value (least
> significant)from each resultant until only significant (0.10) variables
> remained.
> 
> Model:
> MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
>     cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>              Df Deviance    AIC    LRT   Pr(Chi)    
> <none>            884.20 946.20                     
> ORG_CODE      4   916.38 970.38  32.18 1.757e-06 ***
> BECLBL08      9   940.66 984.66  56.46 6.418e-09 ***
> PEM_SScat    11   906.20 946.20  22.00 0.0243795 *  
> SOIL_NUTR     2   894.19 952.19   9.99 0.0067557 ** 
> cSEEDSRCE_SW  1   894.41 954.41  10.21 0.0013983 ** 
> cMSP          1   896.97 956.97  12.77 0.0003516 ***
> ceFFP         1   928.50 988.50  44.30 2.812e-11 ***
> cEXT_Cold     1   923.35 983.35  39.15 3.921e-10 ***
> 
> 
> I didn't create any kind of dummy or factor variables for my categorical
> data (at least, not on purpose).
> 
> With a remaining 8 variables, I tried to run a logistic regression (glm)
> against my dependent variable(MIN_Mstocked).  When I do a summary of the
> glm, I am provided with the usual table of estimate, std error, z value, and
> Pr(>|z|)... BUT there are some coefficients missing in the list.  None of
> the categorical data is complete.  Some are missing only one category, while
> others are missing 4 or 5 categories.  
> 
> e.g.
> 
> Coefficients:
>                        Estimate Std. Error z value Pr(>|z|)    
> (Intercept)          -1.324e+02  1.363e+03  -0.097 0.922611    
> ORG_CODE[T.DLA]      -1.504e+01  1.363e+03  -0.011 0.991192    
> ORG_CODE[T.DMO]      -1.494e+01  1.363e+03  -0.011 0.991253    
> ORG_CODE[T.DPG]      -1.766e+01  1.363e+03  -0.013 0.989658    
> ORG_CODE[T.DVA]      -1.841e+01  1.363e+03  -0.014 0.989220    
> BECLBL08[T.SBS dw 2] -6.733e-01  5.903e-01  -1.141 0.254033    
> BECLBL08[T.SBS dw 3] -1.094e+00  5.714e-01  -1.914 0.055586 .  
> BECLBL08[T.SBS mc 2]  1.573e-01  5.004e-01   0.314 0.753211    
> BECLBL08[T.SBS mc 3]  1.402e+00  5.824e-01   2.408 0.016043 *  
> BECLBL08[T.SBS mk 1] -2.388e+00  7.529e-01  -3.172 0.001514 ** 
> BECLBL08[T.SBS mw]   -1.672e+01  1.393e+03  -0.012 0.990425    
> BECLBL08[T.SBS vk]   -1.614e+01  1.243e+03  -0.013 0.989640    
> BECLBL08[T.SBS wk 1] -3.640e+00  8.174e-01  -4.453 8.48e-06 ***
> BECLBL08[T.SBS wk 3] -1.838e+01  1.363e+03  -0.013 0.989240    
> PEM_SScat[T.B]       -1.815e+01  3.956e+03  -0.005 0.996339    
> PEM_SScat[T.C]        1.998e-01  3.925e-01   0.509 0.610792    
> PEM_SScat[T.D]       -2.314e-01  3.215e-01  -0.720 0.471621    
> PEM_SScat[T.E]        5.581e-01  3.433e-01   1.626 0.104020    
> PEM_SScat[T.F]       -1.113e+00  5.782e-01  -1.926 0.054153 .  
> PEM_SScat[T.G]        1.780e-01  4.420e-01   0.403 0.687150    
> PEM_SScat[T.H]        1.670e+01  3.956e+03   0.004 0.996633    
> PEM_SScat[T.I]        2.751e-01  9.313e-01   0.295 0.767705    
> PEM_SScat[T.J]       -2.623e-01  9.693e-01  -0.271 0.786649    
> PEM_SScat[T.K]       -1.862e+01  3.956e+03  -0.005 0.996244    
> PEM_SScat[T.L]       -1.661e+01  1.211e+03  -0.014 0.989056    
> SOIL_NUTR[T.C]       -1.119e+00  3.781e-01  -2.960 0.003073 ** 
> SOIL_NUTR[T.D]       -7.912e-02  9.049e-01  -0.087 0.930320    
> cSEEDSRCE_SW         -1.512e-03  4.930e-04  -3.066 0.002170 ** 
> cMSP                  1.808e-02  5.304e-03   3.409 0.000652 ***
> ceFFP                 2.889e-01  4.662e-02   6.196 5.80e-10 ***
> cEXT_Cold            -1.880e+00  3.330e-01  -5.647 1.63e-08 ***
> 
> There should be a PEM_Sscat[T.A].  It is the most prevalent occurrence in
> this category.
> 
> ORG_CODE is missing more than 6 categories in the list
> 
> SOIL_NUTR should have a [T.B]
> 
> Does that help? 

Yes.  I don't see a problem however.  First, your variables are
"factors" which means there will be one fewer coefficients than
categories.  One level is a reference group which probably explains
PEM_Sscat and SOIL_NUTR each "missing" one coefficient.  For ORG_CODE,
there were 4 DF in the starting model, 4 DF in the final model with 4
coefficients.  So the 6 missing categories appear to have been missing
from the start.

What do you expect for ORG_CODE?  What does say summary(ORG_CODE) give you?

Are you aware of the dangers of stepwise model fitting?  It is a
commonly recurring theme on this list.

Kevin

> -----Original Message-----
> From: Kevin E. Thorpe [mailto:kevin.thorpe at utoronto.ca] 
> Sent: Saturday, September 27, 2008 6:21 AM
> To: Darin Brooks
> Cc: r-help at r-project.org
> Subject: Re: [R] logistic regression
> 
> 
> Darin Brooks wrote:
>> Good afternoon
>>  
>> I have what I hope is a simple logistic regression issue.
>>  
>> I started with 44 independent variables and then used the drop1, 
>> test="chisq" to reduce the list to 8 significant independent variables.
>>  
>> drop1(sep22lr, test="Chisq") and wound up with this model:
>>  
>> Model: MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
>> cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>>  
>> 4 of the remaining variables are categorical and 4 are continuous.
>>  
>> However, when I run a glm and then a summary on the glm - some of the 
>> categorical data is missing from the output.
>>  
>> The PEM_SScat is missing only one variable ... the BECLBL08 is missing 
>> several variables ... the ORG_CODE is missing 4 .. and the SOIL_NUTR 
>> is missing 1 variable.
>>  
>> It seems arbitrary to the number of variables missing.  Is there 
>> something wrong with my syntax in calling the logistic model?  Am I not
> understanding
>> the inputs correctly?   
>>  
>> Any help would be appreciated.
>>  
> 
> I'm not sure I fully understand your question.  It sounds like you created
> your own dummy variables for your categorical variables. Did you?  Or did
> you use factor variables for your categorical variables?
> If the latter, then I REALLY don't understand your question.
> 
> Kevin

-- 
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program
Assistant Professor, Dalla Lana School of Public Health
University of Toronto
email: kevin.thorpe at utoronto.ca  Tel: 416.864.5776  Fax: 416.864.6057