[R] predict.glm -> which class does it predict?

Fri Jul 10 19:27:39 CEST 2009

2009/7/10 Peter Dalgaard <p.dalgaard at biostat.ku.dk>:
> Peter Schüffler wrote:
>>
>> Hi,
>>
>> I have a question about logistic regression in R.
>>
>> Suppose I have a small list of proteins P1, P2, P3 that predict a
>> two-class target T, say cancer/noncancer. Lets further say I know that I can
>> build a simple logistic regression model in R
>>
>> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of
>> the Proteins).
>>
>> This works fine. T is a factored vector with levels cancer, noncancer.
>> Proteins are numeric.
>>
>> Now, I want to use predict.glm to predict a new data.
>>
>> predict(model, newdata=testsamples, type="response")    (testsamples is a
>> small set of new samples).
>>
>> The result is a vector of the probabilites for each sample in testsamples.
>> But probabilty WHAT for? To belong to the first level in T? To belong to
>> second level in T?
>>
>> Is this fallowing expression
>> factor(predict(model, newdata=testsamples, type="response") >= 0.5)
>> TRUE, when the new sample is classified to Cancer or when it's classified
>> to Noncancer? And why not the other way around?
>
> It's the probability of the 2nd level of a factor response (termed "success"
> in the documentation, even when your modeling the probability of disease or
> death...), just like when interpreting the logistic regression itself.
>
> I find it easiest to sort ut this kind of issue by experimentation in
> simplified situations. E.g.
>
>> x <- sample(c("A","B"),10,replace=TRUE)
>> x
>  [1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A"
>> table(x)
> x
> A B
> 4 6
>
> (notice that the relative frequency of B is 0.6)
>
>> glm(x~1,binomial)
> Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
> In addition: Warning message:
> In model.matrix.default(mt, mf, contrasts) :
>  variable 'x' converted to a factor
>
> (OK, so it won't go without conversion to factor. This is a good thing.)
>
>> glm(factor(x)~1,binomial)
>
> Call:  glm(formula = factor(x) ~ 1, family = binomial)
>
> Coefficients:
> (Intercept)
>     0.4055
>
> Degrees of Freedom: 9 Total (i.e. Null);  9 Residual
> Null Deviance:      13.46
> Residual Deviance: 13.46        AIC: 15.46
>
> (The intercept is positive, corresponding to log odds for a probability >
> 0.5 ; i.e.,  must be that "B": 0.4055==log(6/4))
>
>> predict(glm(factor(x)~1,binomial))
>        1         2         3         4         5         6         7       8
> 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651
> 0.4054651
>        9        10
> 0.4054651 0.4054651
>> predict(glm(factor(x)~1,binomial),type="response")
>  1   2   3   4   5   6   7   8   9  10
> 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
>
> As for why it's not the other way around, well, if it had been, then you
> could have asked the same question....
>

Or more specifically:

> resp <- factor(c("cancer", "noncancer", "noncancer", "noncancer"))
> mod <- glm(resp ~ 1, family = binomial)
> predict(mod, type = "response")
   1    2    3    4
0.75 0.75 0.75 0.75

and since noncancer occurs 75% of the time in the sample clearly
its predicting the probability of noncancer.