[R] FW: logistic regression
Darin Brooks
kfmgis at telus.net
Sat Sep 27 15:44:29 CEST 2008
Sorry.
Let me try again then.
I am trying to find "significant" predictors" from a list of about 44
independent variables. So I started with all 44 variables and ran
drop1(sep22lr, test="Chisq")... and then dropped the highest p value from
the run. Then I reran the drop1.
Model:
MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_MST_1 +
SOIL_NUTR + cE + cN + cELEV + cDIAM_125 + cCRCLS + cCULM_125 +
cSPH + cAGE + cVRI_NONPINE + cVRI_nonpineCFR + cVRI_BLEAF +
cvol_125 + cstrDST_SW + cwaterDST_SW + cSEEDSRCE_SW + cMAT +
cMWMT + cMCMT + cTD + cMAP + cMSP + cAHM + cSHM + cMATMAP +
cddless0 + cddless18 + cddgrtr0 + cddgrtr18 + cNFFD + cbFFP +
ceFFP + cPAS + cDD5_100 + cEXT_Cold + cS_INDX
Df Deviance AIC LRT Pr(Chi)
<none> 814.21 938.21
ORG_CODE 4 824.97 940.97 10.76 0.0294100 *
BECLBL08 9 845.61 951.61 31.41 0.0002519 ***
PEM_SScat 10 829.11 933.11 14.90 0.1357580
SOIL_MST_1 1 814.63 936.63 0.43 0.5135094
SOIL_NUTR 2 818.49 938.49 4.28 0.1175411
cE 1 814.37 936.37 0.16 0.6886085
cN 1 814.40 936.40 0.20 0.6566765
cELEV 1 814.35 936.35 0.14 0.7044864
cDIAM_125 1 817.98 939.98 3.78 0.0519554 .
cCRCLS 1 819.32 941.32 5.11 0.0237598 *
cCULM_125 1 816.17 938.17 1.97 0.1606846
cSPH 1 816.62 938.62 2.41 0.1204141
cAGE 1 815.92 937.92 1.72 0.1902314
cVRI_NONPINE 1 818.04 940.04 3.84 0.0501149 .
cVRI_nonpineCFR 1 821.17 943.17 6.96 0.0083197 **
cVRI_BLEAF 1 818.78 940.78 4.58 0.0324286 *
cvol_125 1 814.67 936.67 0.47 0.4949495
cstrDST_SW 1 814.63 936.63 0.42 0.5169757
cwaterDST_SW 1 814.75 936.75 0.55 0.4592643
cSEEDSRCE_SW 1 817.73 939.73 3.53 0.0604234 .
cMAT 1 814.27 936.27 0.06 0.8002333
cMWMT 1 814.49 936.49 0.28 0.5942246
cMCMT 1 819.39 941.39 5.18 0.0228425 *
cTD 1 816.20 938.20 1.99 0.1580332
cMAP 1 814.25 936.25 0.04 0.8386626
cMSP 1 818.41 940.41 4.20 0.0404411 *
cAHM 1 815.66 937.66 1.46 0.2276311
cSHM 1 819.95 941.95 5.75 0.0165227 *
cMATMAP 1 814.91 936.91 0.71 0.4001878
cddless0 1 818.04 940.04 3.83 0.0502153 .
cddless18 1 817.81 939.81 3.60 0.0576931 .
cddgrtr0 1 816.64 938.64 2.44 0.1184235
cddgrtr18 1 815.77 937.77 1.57 0.2104958
cNFFD 1 815.38 937.38 1.18 0.2782582
cbFFP 1 814.39 936.39 0.18 0.6677481
ceFFP 1 820.22 942.22 6.01 0.0141863 *
cPAS 1 814.21 936.21 0.01 0.9347654
cDD5_100 1 814.79 936.79 0.58 0.4447531
cEXT_Cold 1 816.99 938.99 2.78 0.0954512 .
cS_INDX 1 815.21 937.21 1.01 0.3157208
And then systematically reran the drop1, removing the HIGHEST p value (least
significant)from each resultant until only significant (0.10) variables
remained.
Model:
MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR +
cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
Df Deviance AIC LRT Pr(Chi)
<none> 884.20 946.20
ORG_CODE 4 916.38 970.38 32.18 1.757e-06 ***
BECLBL08 9 940.66 984.66 56.46 6.418e-09 ***
PEM_SScat 11 906.20 946.20 22.00 0.0243795 *
SOIL_NUTR 2 894.19 952.19 9.99 0.0067557 **
cSEEDSRCE_SW 1 894.41 954.41 10.21 0.0013983 **
cMSP 1 896.97 956.97 12.77 0.0003516 ***
ceFFP 1 928.50 988.50 44.30 2.812e-11 ***
cEXT_Cold 1 923.35 983.35 39.15 3.921e-10 ***
I didn't create any kind of dummy or factor variables for my categorical
data (at least, not on purpose).
With a remaining 8 variables, I tried to run a logistic regression (glm)
against my dependent variable(MIN_Mstocked). When I do a summary of the
glm, I am provided with the usual table of estimate, std error, z value, and
Pr(>|z|)... BUT there are some coefficients missing in the list. None of
the categorical data is complete. Some are missing only one category, while
others are missing 4 or 5 categories.
e.g.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.324e+02 1.363e+03 -0.097 0.922611
ORG_CODE[T.DLA] -1.504e+01 1.363e+03 -0.011 0.991192
ORG_CODE[T.DMO] -1.494e+01 1.363e+03 -0.011 0.991253
ORG_CODE[T.DPG] -1.766e+01 1.363e+03 -0.013 0.989658
ORG_CODE[T.DVA] -1.841e+01 1.363e+03 -0.014 0.989220
BECLBL08[T.SBS dw 2] -6.733e-01 5.903e-01 -1.141 0.254033
BECLBL08[T.SBS dw 3] -1.094e+00 5.714e-01 -1.914 0.055586 .
BECLBL08[T.SBS mc 2] 1.573e-01 5.004e-01 0.314 0.753211
BECLBL08[T.SBS mc 3] 1.402e+00 5.824e-01 2.408 0.016043 *
BECLBL08[T.SBS mk 1] -2.388e+00 7.529e-01 -3.172 0.001514 **
BECLBL08[T.SBS mw] -1.672e+01 1.393e+03 -0.012 0.990425
BECLBL08[T.SBS vk] -1.614e+01 1.243e+03 -0.013 0.989640
BECLBL08[T.SBS wk 1] -3.640e+00 8.174e-01 -4.453 8.48e-06 ***
BECLBL08[T.SBS wk 3] -1.838e+01 1.363e+03 -0.013 0.989240
PEM_SScat[T.B] -1.815e+01 3.956e+03 -0.005 0.996339
PEM_SScat[T.C] 1.998e-01 3.925e-01 0.509 0.610792
PEM_SScat[T.D] -2.314e-01 3.215e-01 -0.720 0.471621
PEM_SScat[T.E] 5.581e-01 3.433e-01 1.626 0.104020
PEM_SScat[T.F] -1.113e+00 5.782e-01 -1.926 0.054153 .
PEM_SScat[T.G] 1.780e-01 4.420e-01 0.403 0.687150
PEM_SScat[T.H] 1.670e+01 3.956e+03 0.004 0.996633
PEM_SScat[T.I] 2.751e-01 9.313e-01 0.295 0.767705
PEM_SScat[T.J] -2.623e-01 9.693e-01 -0.271 0.786649
PEM_SScat[T.K] -1.862e+01 3.956e+03 -0.005 0.996244
PEM_SScat[T.L] -1.661e+01 1.211e+03 -0.014 0.989056
SOIL_NUTR[T.C] -1.119e+00 3.781e-01 -2.960 0.003073 **
SOIL_NUTR[T.D] -7.912e-02 9.049e-01 -0.087 0.930320
cSEEDSRCE_SW -1.512e-03 4.930e-04 -3.066 0.002170 **
cMSP 1.808e-02 5.304e-03 3.409 0.000652 ***
ceFFP 2.889e-01 4.662e-02 6.196 5.80e-10 ***
cEXT_Cold -1.880e+00 3.330e-01 -5.647 1.63e-08 ***
There should be a PEM_Sscat[T.A]. It is the most prevalent occurrence in
this category.
ORG_CODE is missing more than 6 categories in the list
SOIL_NUTR should have a [T.B]
Does that help?
-----Original Message-----
From: Kevin E. Thorpe [mailto:kevin.thorpe at utoronto.ca]
Sent: Saturday, September 27, 2008 6:21 AM
To: Darin Brooks
Cc: r-help at r-project.org
Subject: Re: [R] logistic regression
Darin Brooks wrote:
> Good afternoon
>
> I have what I hope is a simple logistic regression issue.
>
> I started with 44 independent variables and then used the drop1,
> test="chisq" to reduce the list to 8 significant independent variables.
>
> drop1(sep22lr, test="Chisq") and wound up with this model:
>
> Model: MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR +
> cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>
> 4 of the remaining variables are categorical and 4 are continuous.
>
> However, when I run a glm and then a summary on the glm - some of the
> categorical data is missing from the output.
>
> The PEM_SScat is missing only one variable ... the BECLBL08 is missing
> several variables ... the ORG_CODE is missing 4 .. and the SOIL_NUTR
> is missing 1 variable.
>
> It seems arbitrary to the number of variables missing. Is there
> something wrong with my syntax in calling the logistic model? Am I not
understanding
> the inputs correctly?
>
> Any help would be appreciated.
>
I'm not sure I fully understand your question. It sounds like you created
your own dummy variables for your categorical variables. Did you? Or did
you use factor variables for your categorical variables?
If the latter, then I REALLY don't understand your question.
Kevin
--
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program Assistant Professor,
Dalla Lana School of Public Health University of Toronto
email: kevin.thorpe at utoronto.ca Tel: 416.864.5776 Fax: 416.864.6057 No
virus found in this incoming message.
Checked by AVG - http://www.avg.com
6:55 PM
More information about the R-help
mailing list