[R] Issue with predict() for glm models

Thu Sep 23 08:32:32 CEST 2004

jrausch at nd.edu wrote:

> Hello everyone, 
> 
> I am having a problem using the predict (or the predict.glm) function in R.
> Basically, I run the glm model on a "training" data set and try to obtain
> predictions for a set of new predictors from a "test" data set (i.e., not the
> predictors that were utilized to obtain the glm parameter estimates).
> Unfortunately, every time that I attempt this, I obtain the predictions for the
> predictors that were used to fit the glm model. I have looked at the R mailing
> list archives and don't believe I am making the same mistakes that have been
> made in the past and also have tried to closely follow the predict.glm example
> in the help file. Here is an example of what I am trying to do: 
> 
> ########################################################
> set.seed(545345)
> 
> ################
> # Necessary Variables # 
> ################
> 
> p <- 2
> train.n <- 20
> test.n <- 25 
> mean.vec.1 <- c(1,1)
> mean.vec.2 <- c(0,0)
> 
> Sigma.1 <- matrix(c(1,.5,.5,1),p,p)
> Sigma.2 <- matrix(c(1,.5,.5,1),p,p)
> 
> ###############
> # Load MASS Library #
> ###############
> 
> library(MASS)
> 
> ###################################
> # Data to Parameters for Logistic Regression Model #
> ###################################
> 
> train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
> train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
> train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n)))
> predictors.train <- rbind(train.data.1,train.data.2)
> 
> ##############################################
> # Test Data Where Predictions for Probabilities Using Logistic Reg.  #
> # From Training Data are of Interest                                          #
> ############################################## 
> 
> test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
> test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
> predictors.test <- rbind(test.data.1,test.data.2)
> 
> ##############################
> # Run Logistic Regression on Training Data #
> ##############################
> 
> log.reg <- glm(train.class.var~predictors.train,
> family=binomial(link="logit"))

Well, you haven't specified the "data" argument, but given the two 
variables directly. Exactly those variables will be used in the 
predict() step below! If you want the predict() step to work, use 
something like:

   train <- data.frame(class = train.class.var,
                       predictors = predictors.train)
   log.reg <- glm(class ~ ., data = train,
                  family=binomial(link="logit"))

> log.reg
> 
> #> log.reg
> 
> #Call:  glm(formula = train.class.var ~ predictors.train, family =
> #binomial(link = "logit")) 
> #
> #Coefficients:
> #      (Intercept)  predictors.train1  predictors.train2  
> #           0.5105            -0.2945            -1.0811  
> #
> #Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
> #Null Deviance:      55.45 
> #Residual Deviance: 41.67        AIC: 47.67 
> 
> ###########################
> # Predicted Probabilities for Test Data #
> ###########################
> 
> New.Data <- data.frame(predictors.train1=predictors.test[,1],
> predictors.train2=predictors.test[,2])
> 
> logreg.pred.prob.test <- predict.glm(log.reg,New.Data,type="response")
> logreg.pred.prob.test

Instead, use:

   test <- data.frame(predictors = predictors.test)
   predict(log.reg, newdata = test, type="response")

note also: please call the generic predict() rather than its glm method.

Uwe Ligges

> #logreg.pred.prob.test
> # [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589 0.71331091
> # [7] 0.17320087 0.14176632 0.30966718 0.61878952 0.12525988 0.21271139
> #[13] 0.70068113 0.18340723 0.10295501 0.44591568 0.72285161 0.31499339
> #[19] 0.65789420 0.42750139 0.14435889 0.93008117 0.70798465 0.80109005
> #[25] 0.89161472 0.47480625 0.56520952 0.63981834 0.57595189 0.60075882
> #[31] 0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955
> #[37] 0.80855782 0.90835588 0.54809117 0.11568637
> ########################################################
> 
> Of course, notice that the vector for the predicted probabilities has only 40
> elements, while the "New.Data" has 50 elements (since n.test has 25 per group
> for 2 groups) and thus should have 50 predicted probabilities. As it turns out,
> the output is for the training data predictors and not for the "New.Data" as I
> would like it to be. I should also note that I have made sure that the names
> for the predictors in the "New.Data" are the same as the names for the
> predictors within the glm object (i.e., within "log.reg") as this is what is
> done in the example for predict.glm() within the help files. 
> 
> Could some one help me understand either what I am doing incorrectly or what
> problems there might be within the predict() function? I should mention that I
> tried the same program using predict.glm() and obtained the same problematic
> results. 
> 
> Thanks and take care, 
> 
> Joe 
> 
> 
> Joe Rausch, M.A. 
> Psychology Liaison 
> Lab for Social Research 
> 917 Flanner Hall 
> University of Notre Dame 
> Notre Dame, IN 46556
> (574) 631-3910
> 
> "If we knew what it was we were doing, it would not be called research, would
> it?"
> - Albert Einstein
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html