[R] nontabular logistic regression

Fri Oct 13 17:42:45 CEST 2006

On Fri, 2006-10-13 at 09:28 -0500, Jeffrey Stratford wrote:
> Hi.  I'm attempting to fit a logistic/binomial model so I can determine
> the influence of landscape on the probability that a box gets used by a
> bird.  I've looked at a few sources (MASS text, Dalgaard, Fox and
> google) and the examples are almost always based on tabular predictor
> variables.  My data, however are not.  I'm not sure if that is the
> source of the problems or not because the one example that includes a
> continuous predictor looks to be coded exactly the same way.  Looking at
> the output, I get estimates for each case when I should get a single
> estimate for purbank.  Any suggestions?
> 
> Many thanks,
> 
> Jeff
> 
> 
> THE DATA: (200 boxes total, used [0 if unoccupied, 1 occupied], the rest
> are landscape variables).  
> 
> box	use	purbank	purban2	purban1	pgrassk	pgrass2	pgrass1	grassdist	grasspatchk
> 1	1	0.003813435	0.02684564	0.06896552	0.3282487	0.6845638	0.7586207	0	3.73
> 2	1	0.04429451	0.1610738	0.1724138	0.1534174	0.3825503	0.6551724	0	1.023261
> 3	1	0.04458785	0.06040268	0	0.1628043	0.557047	0.7586207	0	0.9605769
> 4	1	0.06072162	0.2080537	0.06896552	0.01936052	0	0	323.1099	0.2284615
> 5	0	0.6080962	0.6979866	0.6896552	0.03168084	0.1275168	0.2413793	30	0.2627027
> 6	1	0.6060428	0.6107383	0.3448276	0.04077442	0.2885906	0.4482759	30	0.2978571
> 7	1	0.3807568	0.4362416	0.6896552	0.06864183	0.03355705	0	94.86833	0.468
> 8	0	0.3649164	0.3154362	0.4137931	0.06277501	0.1275168	0	120	0.4585714
> 
> THE CODE:
> 
> box.use<- read.csv("c:\\eabl\\2004\\use_logistic2.csv", header=TRUE)
> attach(box.use)
> box.use <- na.omit(box.use)
> use <- factor(use, levels=0:1)
> levels(use) <- c("unused", "used")
> glm1 <- glm(use ~ purbank, binomial)
> 
> THE OUTPUT:
> 
> Coefficients:
>                      Estimate Std. Error   z value Pr(>|z|)
> (Intercept)        -4.544e-16  1.414e+00 -3.21e-16    1.000
> purbank0            2.157e+01  2.923e+04     0.001    0.999
> purbank0.001173365  2.157e+01  2.067e+04     0.001    0.999
> purbank0.001466706  2.157e+01  2.923e+04     0.001    0.999
> purbank0.001760047  6.429e-16  2.000e+00  3.21e-16    1.000
> purbank0.002346729  2.157e+01  2.923e+04     0.001    0.999
> purbank0.003813435  2.157e+01  2.923e+04     0.001    0.999
> purbank0.004106776  2.157e+01  2.067e+04     0.001    0.999
> purbank0.004693458  2.157e+01  2.067e+04     0.001    0.999

It appears that the 'purbank' variable is being imported as a factor,
hence the multiple levels indicated in the left hand column. 

Check:

  str(box.use)

right after the read.csv() step and see what it shows.

>From the sample data above, it _should_ be along the lines of:

> str(box.use)
'data.frame':   8 obs. of  10 variables:
 $ box        : int  1 2 3 4 5 6 7 8
 $ use        : int  1 1 1 1 0 1 1 0
 $ purbank    : num  0.00381 0.04429 0.04459 0.06072 0.60810 ...
 $ purban2    : num  0.0268 0.1611 0.0604 0.2081 0.6980 ...
 $ purban1    : num  0.069 0.172 0.000 0.069 0.690 ...
 $ pgrassk    : num  0.3282 0.1534 0.1628 0.0194 0.0317 ...
 $ pgrass2    : num  0.685 0.383 0.557 0.000 0.128 ...
 $ pgrass1    : num  0.759 0.655 0.759 0.000 0.241 ...
 $ grassdist  : num    0   0   0 323  30 ...
 $ grasspatchk: num  3.730 1.023 0.961 0.228 0.263 ...

Hence, you should be able to use:

> model <- glm(use ~ purbank, data = box.use, family = binomial)

> summary(model)

Call:
glm(formula = use ~ purbank, family = binomial, data = box.use)

Deviance Residuals:
     Min        1Q    Median        3Q       Max
-1.61450  -0.03098   0.31935   0.45888   1.39194

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    3.223      2.225   1.448    0.147
purbank       -6.129      4.773  -1.284    0.199

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8.9974  on 7  degrees of freedom
Residual deviance: 6.5741  on 6  degrees of freedom
AIC: 10.574

Number of Fisher Scoring iterations: 5

Note that na.omit() is the default operation for most R models, so is
redundant. Also, I would not attach the data frame, as you can use the
'data' argument in model related functions. This avoids the confusion of
having multiple copies of the source data set and wondering why changes
made can become confusing and problematic.

HTH,

Marc Schwartz