[R] Issue with predict() for glm models

John Fox jfox at mcmaster.ca
Wed Sep 22 22:21:26 CEST 2004


Dear Mark and Joe,

Actually, the problem here appears to be caused by the use of a matrix
on the RHS of the model formula. I'm not sure why this doesn't work (I
must be missing something -- perhaps someone else can say what), but
Joe can get the output he expects by specifying the columns of his
matrix as individual predictors in the model formula. BTW, it's better
form to call the generic predict() rather than the method predict.glm()
directly, though the latter will work here.

Editing the original input:

> x1 <- predictors.train[,1]
> x2 <- predictors.train[,2]
> 
> log.reg <- glm(train.class.var ~ x1 + x2,
+         family=binomial(link="logit"))
> log.reg

Call:  glm(formula = train.class.var ~ x1 + x2, family = binomial(link
= "logit")) 

Coefficients:
(Intercept)           x1           x2  
     0.5102      -0.6118      -0.3192  

Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
Null Deviance:      55.45 
Residual Deviance: 46.49        AIC: 52.49 
> New.Data <- data.frame(x1=predictors.test[,1],
x2=predictors.test[,2])
> 
> logreg.pred.prob.test <- predict(log.reg,New.Data, type="response")
> logreg.pred.prob.test
 [1] 0.2160246 0.2706139 0.3536572 0.6206490 0.5218391 0.2363767
0.1072153
 [8] 0.6405459 0.4436666 0.6680043 0.3377492 0.5892127 0.3230353
0.7540425
[15] 0.2889855 0.5163141 0.6187335 0.1447511 0.5066670 0.4424428
0.4141701
[22] 0.3947212 0.4065674 0.6226195 0.5053101 0.4311552 0.4261810
0.4784102
[29] 0.5126050 0.6756437 0.6147516 0.7659146 0.5219031 0.3938457
0.6495470
[36] 0.5178400 0.8185613 0.7167129 0.5414552 0.8687371 0.5415976
0.8048741
[43] 0.7796451 0.5565636 0.6058371 0.7053130 0.1521769 0.7120320
0.4073465
[50] 0.6801101


I hope this helps,
 John



On Wed, 22 Sep 2004 15:17:23 -0300
 "Fowler, Mark" <FowlerM at mar.dfo-mpo.gc.ca> wrote:
> Perhaps your approach reflects a method of producing a prediction
> dataframe
> that is just unfamiliar to me, but it looks to me like you have
> created two
> predictor variables based on the names of the levels of the original
> predictor (predictors.train1, predictors.train2). I don't know how
> the glm
> function would know that predictors.train1 and predictors.train2 are
> two
> subs for predictors.train. Maybe try just using one prediction
> variable, and
> give it the original variable name (predictors.train). If this works,
> just
> repeat for your second set of values.
> 
> >	Mark Fowler
> >	Marine Fish Division
> >	Bedford Inst of Oceanography
> >	Dept Fisheries & Oceans
> >	Dartmouth NS Canada
> >	fowlerm at mar.dfo-mpo.gc.ca
> >
> 
> 
> -----Original Message-----
> From: jrausch at nd.edu [mailto:jrausch at nd.edu] 
> Sent: September 22, 2004 2:53 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Issue with predict() for glm models
> 
> 
> 
> Hello everyone, 
> 
> I am having a problem using the predict (or the predict.glm) function
> in R.
> Basically, I run the glm model on a "training" data set and try to
> obtain
> predictions for a set of new predictors from a "test" data set (i.e.,
> not
> the predictors that were utilized to obtain the glm parameter
> estimates).
> Unfortunately, every time that I attempt this, I obtain the
> predictions for
> the predictors that were used to fit the glm model. I have looked at
> the R
> mailing list archives and don't believe I am making the same mistakes
> that
> have been made in the past and also have tried to closely follow the
> predict.glm example in the help file. Here is an example of what I am
> trying
> to do: 
> 
> ########################################################
> set.seed(545345)
> 
> ################
> # Necessary Variables # 
> ################
> 
> p <- 2
> train.n <- 20
> test.n <- 25 
> mean.vec.1 <- c(1,1)
> mean.vec.2 <- c(0,0)
> 
> Sigma.1 <- matrix(c(1,.5,.5,1),p,p)
> Sigma.2 <- matrix(c(1,.5,.5,1),p,p)
> 
> ###############
> # Load MASS Library #
> ###############
> 
> library(MASS)
> 
> ###################################
> # Data to Parameters for Logistic Regression Model #
> ###################################
> 
> train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
> train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
> train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n)))
> predictors.train <- rbind(train.data.1,train.data.2)
> 
> ##############################################
> # Test Data Where Predictions for Probabilities Using Logistic Reg.
>  #
> # From Training Data are of Interest
> #
> ############################################## 
> 
> test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
> test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
> predictors.test <- rbind(test.data.1,test.data.2)
> 
> ##############################
> # Run Logistic Regression on Training Data #
> ##############################
> 
> log.reg <- glm(train.class.var~predictors.train,
> family=binomial(link="logit"))
> log.reg
> 
> #> log.reg
> 
> #Call:  glm(formula = train.class.var ~ predictors.train, family =
> #binomial(link = "logit")) 
> #
> #Coefficients:
> #      (Intercept)  predictors.train1  predictors.train2  
> #           0.5105            -0.2945            -1.0811  
> #
> #Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
> #Null Deviance:      55.45 
> #Residual Deviance: 41.67        AIC: 47.67 
> 
> ###########################
> # Predicted Probabilities for Test Data # ###########################
> 
> New.Data <- data.frame(predictors.train1=predictors.test[,1],
> predictors.train2=predictors.test[,2])
> 
> logreg.pred.prob.test <-
> predict.glm(log.reg,New.Data,type="response")
> logreg.pred.prob.test
> 
> #logreg.pred.prob.test
> # [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589
> 0.71331091 #
> [7] 0.17320087 0.14176632 0.30966718 0.61878952 0.12525988 0.21271139
> #[13]
> 0.70068113 0.18340723 0.10295501 0.44591568 0.72285161 0.31499339
> #[19]
> 0.65789420 0.42750139 0.14435889 0.93008117 0.70798465 0.80109005
> #[25]
> 0.89161472 0.47480625 0.56520952 0.63981834 0.57595189 0.60075882
> #[31]
> 0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955
> #[37]
> 0.80855782 0.90835588 0.54809117 0.11568637
> ########################################################
> 
> Of course, notice that the vector for the predicted probabilities has
> only
> 40 elements, while the "New.Data" has 50 elements (since n.test has
> 25 per
> group for 2 groups) and thus should have 50 predicted probabilities.
> As it
> turns out, the output is for the training data predictors and not for
> the
> "New.Data" as I would like it to be. I should also note that I have
> made
> sure that the names for the predictors in the "New.Data" are the same
> as the
> names for the predictors within the glm object (i.e., within
> "log.reg") as
> this is what is done in the example for predict.glm() within the help
> files.
> 
> 
> Could some one help me understand either what I am doing incorrectly
> or what
> problems there might be within the predict() function? I should
> mention that
> I tried the same program using predict.glm() and obtained the same
> problematic results. 
> 
> Thanks and take care, 
> 
> Joe 
> 
> 
> Joe Rausch, M.A. 
> Psychology Liaison 
> Lab for Social Research 
> 917 Flanner Hall 
> University of Notre Dame 
> Notre Dame, IN 46556
> (574) 631-3910
> 
> "If we knew what it was we were doing, it would not be called
> research,
> would it?"
> - Albert Einstein
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
http://socserv.mcmaster.ca/jfox/




More information about the R-help mailing list