[R] fitting in logistic model

Tue Jun 30 19:41:20 CEST 2009

On Jun 30, 2009, at 10:44 AM, Ted Harding wrote:

>
> On 30-Jun-09 14:52:20, Marc Schwartz wrote:
>> On Jun 30, 2009, at 4:54 AM, Renzi Fabrizio wrote:
>>
>>> I would like to know how R computes the probability of an event
>>> in a logistic model (P(y=1)) from the score s, linear combination
>>> of x and beta.
>>>
>>> I noticed that there are differences (small, less than e-16) between
>>> the fitting values automatically computed in the glm procedure by R,
>>> and the values "manually" computed by me applying the reverse
>>> formula p=e^s/(1+e^s); moreover I noticed  that the minimum value
>>> of the fitting values in my estimation is 2.220446e-16, and there
>>> are many observation with this probability (instead the minimum
>>> value obtained by "manually" estimation is 2.872636e-152).
>>
>> It would be helpful to see at least a subset of the output from your
>> model and your manual computations so that we can at least visually
>> compare the results to see where the differences may be.
>>
>> The model object returned from using glm() will contain both the
>> linear predictors on the link scale (model$linear.predictors) and
>> the fitted values (model$fitted.values). The latter can be accessed
>> using the fitted() extractor function.
>>
>> To use an example, let's create a simple LR model using the infert
>> data set as referenced in ?glm.
>>
>> model1 <- glm(case ~ spontaneous + induced, data = infert,
>>              family = binomial())
>>
>>> model1
>> Call:  glm(formula = case ~ spontaneous + induced,
>             family = binomial(), data = infert)
>>
>> Coefficients:
>> (Intercept)  spontaneous      induced
>>     -1.7079       1.1972       0.4181
>>
>> Degrees of Freedom: 247 Total (i.e. Null);  245 Residual
>> Null Deviance:            316.2
>> Residual Deviance: 279.6      AIC: 285.6
>>
>> # Get the coefficients
>>> coef(model1)
>> (Intercept) spontaneous     induced
>>  -1.7078601   1.1972050   0.4181294
>>
>> # get the linear predictor values
>> # log odds scale for binomial glm
>>> head(model1$linear.predictors)
>>           1           2           3           4           5
>> 6
>>  1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
>> 0.32560375
>>
>> You can also get the above by using the coefficients and the model
>> matrix for comparison:
>> # the full set of 248
>> # coef(model1) %*% t(model.matrix(model1))
>>> head(as.vector(coef(model1) %*% t(model.matrix(model1))))
>> [1]  1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
>> 0.32560375
>>
>> # get fitted response values (predicted probs)
>>> head(fitted(model1))
>>         1         2         3         4         5         6
>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>
>> We can also get the fitted values from the linear predictor values
>> by using:
>>
>> LP <- model1$linear.predictors
>>> head(exp(LP) / (1 + exp(LP)))
>>        1         2         3         4         5         6
>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>
>> You can also get the above by using the predict.glm() function with
>> type == "response". The default type of "link" will get you the  
>> linear
>> predictor values as above. predict.glm() would typically be used to
>> generate predictions using the model on new data.
>>
>>> head(predict(model1, type = "response"))
>>         1         2         3         4         5         6
>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>
>> In glm.fit(), which is the workhorse function in glm(), the fitted
>> values returned in the model object are actually computed by using
>> the inverse link function for the family passed to glm():
>>
>>> binomial()$linkinv
>> function (eta)
>> .Call("logit_linkinv", eta, PACKAGE = "stats")
>> <environment: 0x171c8b10>
>>
>> Thus:
>>> head(binomial()$linkinv(model1$linear.predictors))
>>         1         2         3         4         5         6
>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>
>> So those are the same values as we saw above using the other methods.
>> So, all is consistent across the various methods.
>>
>> Perhaps the above provides some insights for you into how R does some
>> of these things and also to point out, as is frequently the case with
>> R, there is more than one way to get the same result.
>>
>> HTH,
>> Marc Schwartz
>
> An interesting and informative reply, Marc; but I think it misses
> the point of Fabrizio's query. I think Fabrizio's point is the
> following:
>
>  set.seed(54321)
>  X <- sort(rnorm(100))
>  a0 <- 1.0 ; b0 <- 0.5
>  Y <- 1*(runif(100) < a0 + b0*X)
>  GLM <- glm(Y~X,family=binomial)
>  C <- GLM$coef
>  C
>  # (Intercept)           X
>  #    2.726966    2.798429
>  a1 <- C[1] ; b1 <- C[2]
>  Fit0 <- GLM$fit
>  S <- a1 + b1*X
>  Fit1 <- exp(S)/(1+exp(S))
>  max(abs(Fit1 - Fit0))
>  # [1] 1.110223e-16
>
> This discrepancy is of course, in magnitude, a typical "rounding
> error". But the fact that it occurred indicates that when glm()
> computed the fitted values it did not do so by using the fitted
> coefficients GLM$coef, then creating the fitted score (linear
> predictor) S (as above), then applyhing to this the "inverse link"
> exp(S)/(1+exp(S)), since doing that would replicate the above
> calculation and should yield identical results.
>
> In fact, a bit more probing to GLM shows that there is already
> a discrepancy at the "score" level:
>
>  S0 <- GLM$linear.predictors
>  max(abs(S0-S))
>  # [1] 8.881784e-16
>
> so S0 has not been calculated by applying GLM$coef to X. Also, if
> we apply the inverse link to S0, then:
>
>  max(abs(Fit0 - exp(S0)/(1+exp(S0))))
>  # [1] 1.110223e-16
>
> which is the same discrepancy as between Fit1 and Fit0.
>
>  max(abs(Fit1 - exp(S0)/(1+exp(S0))))
>  # [1] 1.110223e-16
>
> the same again!!
>
> Hence, if I have understood him aright, Fabrizio's question.
>
> Hoping this helps,
> Ted.

Ted,

What OS and version of R are you using?

I am on OSX 10.5.7 and using 32 bit:

   R version 2.9.1 Patched (2009-06-30 r48877)

The reason that I ask is that using your data and model, I get:

 > max(abs(Fit1 - Fit0))
[1] 0

 >  max(abs(S0-S))
[1] 0

 >  max(abs(Fit0 - exp(S0)/(1+exp(S0))))
[1] 0

Truth be told, I was initially leaning in the direction of a rounding  
error, but first wanted to be sure that there was not some underlying  
methodological error that Fabrizio was encountering. Hence the focus  
of my reply was to present a level of consistency in getting the  
results using multiple methods.

When I saw your post, I started thinking that my example, which did  
not have a numeric (eg. non-integer) covariate, might have been too  
simple and missed introducing an additional source of rounding error,  
but then I could not replicate your findings either...

Thanks,

Marc