[R] Issue with predict() for glm models
Uwe Ligges
ligges at statistik.uni-dortmund.de
Thu Sep 23 15:06:06 CEST 2004
John Fox wrote:
> Dear Uwe,
>
> Unless I've somehow messed this up, as I mentioned yesterday, what you
> suggest doesn't seem to work when the predictor is a matrix. Here's a
> simplified example:
>
>
>>X <- matrix(rnorm(200), 100, 2)
>>y <- (X %*% c(1,2) + rnorm(100)) > 0
>>dat <- data.frame(y=y, X=X)
>>mod <- glm(y ~ X, family=binomial, data=dat)
>>new <- data.frame(X = matrix(rnorm(20),2))
>>predict(mod, new)
Dear John,
the questioner had a 2 column matrix with 40 and one with 50
observations (not a 100 column matrix with 2 observation) and for those
matrices it works ...
Best,
Uwe
> 1 2 3 4 5
> 6
> 1.81224443 -5.92955128 1.98718051 -10.05331521 2.65065555
> -2.50635812
> 7 8 9 10 11
> 12
> 5.63728698 -0.94845276 -3.61657377 -1.63141320 5.03417372
> 1.80400271
> 13 14 15 16 17
> 18
> 9.32876273 -5.32723406 5.29373023 -3.90822713 -10.95065186
> 4.90038016
>
> . . .
>
> 97 98 99 100
> -6.92509812 0.59357486 -1.17205723 0.04209578
>
>
> Note that there are 100 rather than 10 predicted values.
>
> But with individuals predictors (rather than a matrix),
>
>
>>x1 <- X[,1]
>>x2 <- X[,2]
>>dat.2 <- data.frame(y=y, x1=x1, x2=x2)
>>mod.2 <- glm(y ~ x1 + x2, family=binomial, data=dat.2)
>>new.2 <- data.frame(x1=rnorm(10), x2=rnorm(10))
>>predict(mod.2, new.2)
>
> 1 2 3 4 5 6 7
>
> 6.5723823 0.6356392 4.0291018 -4.7914650 2.1435485 -3.1738096 -2.8261585
>
> 8 9 10
> -1.5255329 -4.7087592 4.0619290
>
> works as expected (?).
>
> Regards,
> John
>
>
>
>>-----Original Message-----
>>From: r-help-bounces at stat.math.ethz.ch
>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Uwe Ligges
>>Sent: Thursday, September 23, 2004 1:33 AM
>>To: jrausch at nd.edu
>>Cc: r-help at stat.math.ethz.ch
>>Subject: Re: [R] Issue with predict() for glm models
>>
>>jrausch at nd.edu wrote:
>>
>>
>>>Hello everyone,
>>>
>>>I am having a problem using the predict (or the
>>
>>predict.glm) function in R.
>>
>>>Basically, I run the glm model on a "training" data set and try to
>>>obtain predictions for a set of new predictors from a
>>
>>"test" data set
>>
>>>(i.e., not the predictors that were utilized to obtain the
>>
>>glm parameter estimates).
>>
>>>Unfortunately, every time that I attempt this, I obtain the
>>>predictions for the predictors that were used to fit the
>>
>>glm model. I
>>
>>>have looked at the R mailing list archives and don't believe I am
>>>making the same mistakes that have been made in the past
>>
>>and also have
>>
>>>tried to closely follow the predict.glm example in the help
>>
>>file. Here is an example of what I am trying to do:
>>
>>>########################################################
>>>set.seed(545345)
>>>
>>>################
>>># Necessary Variables #
>>>################
>>>
>>>p <- 2
>>>train.n <- 20
>>>test.n <- 25
>>>mean.vec.1 <- c(1,1)
>>>mean.vec.2 <- c(0,0)
>>>
>>>Sigma.1 <- matrix(c(1,.5,.5,1),p,p)
>>>Sigma.2 <- matrix(c(1,.5,.5,1),p,p)
>>>
>>>###############
>>># Load MASS Library #
>>>###############
>>>
>>>library(MASS)
>>>
>>>###################################
>>># Data to Parameters for Logistic Regression Model #
>>>###################################
>>>
>>>train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
>>>train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
>>>train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n)))
>>>predictors.train <- rbind(train.data.1,train.data.2)
>>>
>>>##############################################
>>># Test Data Where Predictions for Probabilities Using
>>
>>Logistic Reg. #
>>
>>># From Training Data are of Interest
>>
>> #
>>
>>>##############################################
>>>
>>>test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
>>>test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
>>>predictors.test <- rbind(test.data.1,test.data.2)
>>>
>>>##############################
>>># Run Logistic Regression on Training Data #
>>>##############################
>>>
>>>log.reg <- glm(train.class.var~predictors.train,
>>>family=binomial(link="logit"))
>>
>>Well, you haven't specified the "data" argument, but given
>>the two variables directly. Exactly those variables will be
>>used in the
>>predict() step below! If you want the predict() step to work,
>>use something like:
>>
>> train <- data.frame(class = train.class.var,
>> predictors = predictors.train)
>> log.reg <- glm(class ~ ., data = train,
>> family=binomial(link="logit"))
>>
>>
>>
>>
>>>log.reg
>>>
>>>#> log.reg
>>>
>>>#Call: glm(formula = train.class.var ~ predictors.train, family =
>>>#binomial(link = "logit")) #
>>>#Coefficients:
>>># (Intercept) predictors.train1 predictors.train2
>>># 0.5105 -0.2945 -1.0811
>>>#
>>>#Degrees of Freedom: 39 Total (i.e. Null); 37 Residual
>>>#Null Deviance: 55.45
>>>#Residual Deviance: 41.67 AIC: 47.67
>>>
>>>###########################
>>># Predicted Probabilities for Test Data #
>>
>>###########################
>>
>>>New.Data <- data.frame(predictors.train1=predictors.test[,1],
>>>predictors.train2=predictors.test[,2])
>>>
>>>logreg.pred.prob.test <-
>>
>>predict.glm(log.reg,New.Data,type="response")
>>
>>>logreg.pred.prob.test
>>
>>Instead, use:
>>
>> test <- data.frame(predictors = predictors.test)
>> predict(log.reg, newdata = test, type="response")
>>
>>
>>note also: please call the generic predict() rather than its
>>glm method.
>>
>>
>>Uwe Ligges
>>
>>
>>
>>>#logreg.pred.prob.test
>>># [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589
>>>0.71331091 # [7] 0.17320087 0.14176632 0.30966718 0.61878952
>>>0.12525988 0.21271139 #[13] 0.70068113 0.18340723 0.10295501
>>>0.44591568 0.72285161 0.31499339 #[19] 0.65789420 0.42750139
>>>0.14435889 0.93008117 0.70798465 0.80109005 #[25] 0.89161472
>>>0.47480625 0.56520952 0.63981834 0.57595189 0.60075882 #[31]
>>>0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955
>>>#[37] 0.80855782 0.90835588 0.54809117 0.11568637
>>>########################################################
>>>
>>>Of course, notice that the vector for the predicted
>>
>>probabilities has
>>
>>>only 40 elements, while the "New.Data" has 50 elements
>>
>>(since n.test
>>
>>>has 25 per group for 2 groups) and thus should have 50 predicted
>>>probabilities. As it turns out, the output is for the training data
>>>predictors and not for the "New.Data" as I would like it to be. I
>>>should also note that I have made sure that the names for the
>>>predictors in the "New.Data" are the same as the names for the
>>>predictors within the glm object (i.e., within "log.reg")
>>
>>as this is what is done in the example for predict.glm()
>>within the help files.
>>
>>>Could some one help me understand either what I am doing
>>
>>incorrectly
>>
>>>or what problems there might be within the predict() function? I
>>>should mention that I tried the same program using
>>
>>predict.glm() and
>>
>>>obtained the same problematic results.
>>>
>>>Thanks and take care,
>>>
>>>Joe
>>>
>>>
>>>Joe Rausch, M.A.
>>>Psychology Liaison
>>>Lab for Social Research
>>>917 Flanner Hall
>>>University of Notre Dame
>>>Notre Dame, IN 46556
>>>(574) 631-3910
>>>
>>>"If we knew what it was we were doing, it would not be called
>>>research, would it?"
>>>- Albert Einstein
>>>
>>>______________________________________________
>>>R-help at stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide!
>>>http://www.R-project.org/posting-guide.html
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide!
>>http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list