[R] predict.glm -> which class does it predict?
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Fri Jul 10 17:48:31 CEST 2009
Peter Schüffler wrote:
> Hi,
>
> I have a question about logistic regression in R.
>
> Suppose I have a small list of proteins P1, P2, P3 that predict a
> two-class target T, say cancer/noncancer. Lets further say I know that I
> can build a simple logistic regression model in R
>
> model <- glm(T ~ ., data=d.f(Y), family=binomial) (Y is the dataset of
> the Proteins).
>
> This works fine. T is a factored vector with levels cancer, noncancer.
> Proteins are numeric.
>
> Now, I want to use predict.glm to predict a new data.
>
> predict(model, newdata=testsamples, type="response") (testsamples is
> a small set of new samples).
>
> The result is a vector of the probabilites for each sample in
> testsamples. But probabilty WHAT for? To belong to the first level in T?
> To belong to second level in T?
>
> Is this fallowing expression
> factor(predict(model, newdata=testsamples, type="response") >= 0.5)
> TRUE, when the new sample is classified to Cancer or when it's
> classified to Noncancer? And why not the other way around?
It's the probability of the 2nd level of a factor response (termed
"success" in the documentation, even when your modeling the probability
of disease or death...), just like when interpreting the logistic
regression itself.
I find it easiest to sort ut this kind of issue by experimentation in
simplified situations. E.g.
> x <- sample(c("A","B"),10,replace=TRUE)
> x
[1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A"
> table(x)
x
A B
4 6
(notice that the relative frequency of B is 0.6)
> glm(x~1,binomial)
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
In addition: Warning message:
In model.matrix.default(mt, mf, contrasts) :
variable 'x' converted to a factor
(OK, so it won't go without conversion to factor. This is a good thing.)
> glm(factor(x)~1,binomial)
Call: glm(formula = factor(x) ~ 1, family = binomial)
Coefficients:
(Intercept)
0.4055
Degrees of Freedom: 9 Total (i.e. Null); 9 Residual
Null Deviance: 13.46
Residual Deviance: 13.46 AIC: 15.46
(The intercept is positive, corresponding to log odds for a probability
> 0.5 ; i.e., must be that "B": 0.4055==log(6/4))
> predict(glm(factor(x)~1,binomial))
1 2 3 4 5 6 7
8
0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651
0.4054651
9 10
0.4054651 0.4054651
> predict(glm(factor(x)~1,binomial),type="response")
1 2 3 4 5 6 7 8 9 10
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
As for why it's not the other way around, well, if it had been, then you
could have asked the same question....
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list