[R] Issue with predict() for glm models

Thu Sep 23 16:22:34 CEST 2004

Dear Uwe, 

> -----Original Message-----
> From: Uwe Ligges [mailto:ligges at statistik.uni-dortmund.de] 
> Sent: Thursday, September 23, 2004 8:06 AM
> To: John Fox
> Cc: jrausch at nd.edu; r-help at stat.math.ethz.ch
> Subject: Re: [R] Issue with predict() for glm models
> 
> John Fox wrote:
> 
> > Dear Uwe,
> > 
> > Unless I've somehow messed this up, as I mentioned 
> yesterday, what you 
> > suggest doesn't seem to work when the predictor is a 
> matrix. Here's a 
> > simplified example:
> > 
> > 
> >>X <- matrix(rnorm(200), 100, 2)
> >>y <- (X %*% c(1,2) + rnorm(100)) > 0
> >>dat <- data.frame(y=y, X=X)
> >>mod <- glm(y ~ X, family=binomial, data=dat) new <- data.frame(X = 
> >>matrix(rnorm(20),2)) predict(mod, new)
> 
> Dear John,
> 
> the questioner had a 2 column matrix with 40 and one with 50 
> observations (not a 100 column matrix with 2 observation) and 
> for those matrices it works ...
> 

Indeed, and in my example the matrix predictor X has 2 columns and 100 rows;
I did screw up the matrix for the "new" data to be used for predictions (in
the example I sent today but not yesterday), but even when this is done
right -- where the new data has 10 rows and 2 columns -- there are 100 (not
10) predicted values:

> X <- matrix(rnorm(200), 100, 2)  # original predictor matrix with 100 rows
> y <- (X %*% c(1,2) + rnorm(100)) > 0
> dat <- data.frame(y=y, X=X)
> mod <- glm(y ~ X, family=binomial, data=dat)
> new <- data.frame(X = matrix(rnorm(20),10, 2)) # corrected -- note 10 rows
> predict(mod, new) # note 100 predicted values
           1            2            3            4            5
6 
  5.75238091   0.31874587  -3.00515893  -3.77282121  -1.97511221
0.54712914 
           7            8            9           10           11
12 
  1.85091226   4.38465524  -0.41028694  -1.53942869   0.57613555
-1.82761518 

 . . .

          91           92           93           94           95
96 
  0.36210780   1.71358713  -9.63612775  -4.54257576  -5.29740468
2.64363405 
          97           98           99          100 
 -4.45478627  -2.44973209   2.51587537  -4.09584837 

Actually, I now see the source of the problem:

The data frames dat and new don't contain a matrix named "X"; rather the
matrix is split columnwise:

> names(dat)
[1] "y"   "X.1" "X.2"
> names(new)
[1] "X.1" "X.2"

Consequently, both glm and predict pick up the X in the global environment
(since there is none in dat or new), which accounts for why there are 100
predicted values.

Using list() rather than data.frame() produces the originally expected
behaviour:

> new <- list(X = matrix(rnorm(20),10, 2))
> predict(mod, new)
         1          2          3          4          5          6          7

 5.9373064  0.3687360 -8.3793045  0.7645584 -2.6773842  2.4130547  0.7387318

         8          9         10 
-0.4347916  8.4678728 -0.8976054 

Regards,
 John

> Best,
> Uwe
> 
> 
> 
> 
> 
> 
> 
> >            1            2            3            4            5
> > 6 
> >   1.81224443  -5.92955128   1.98718051 -10.05331521   2.65065555
> > -2.50635812 
> >            7            8            9           10           11
> > 12 
> >   5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
> > 1.80400271 
> >           13           14           15           16           17
> > 18 
> >   9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
> > 4.90038016
> > 
> >  . . .
> > 
> >            97           98           99          100 
> >  -6.92509812   0.59357486  -1.17205723   0.04209578 
> > 
> > 
> > Note that there are 100 rather than 10 predicted values.
> > 
> > But with individuals predictors (rather than a matrix),
> > 
> > 
> >>x1 <- X[,1]
> >>x2 <- X[,2]
> >>dat.2 <- data.frame(y=y, x1=x1, x2=x2)
> >>mod.2 <- glm(y ~ x1 + x2, family=binomial, data=dat.2)
> >>new.2 <- data.frame(x1=rnorm(10), x2=rnorm(10)) 
> predict(mod.2, new.2)
> > 
> >          1          2          3          4          5      
>     6          7
> > 
> >  6.5723823  0.6356392  4.0291018 -4.7914650  2.1435485 -3.1738096 
> > -2.8261585
> > 
> >          8          9         10 
> > -1.5255329 -4.7087592  4.0619290
> > 
> > works as expected (?).
> > 
> > Regards,
> >  John
> >  
> > 
> > 
> >>-----Original Message-----
> >>From: r-help-bounces at stat.math.ethz.ch 
> >>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Uwe Ligges
> >>Sent: Thursday, September 23, 2004 1:33 AM
> >>To: jrausch at nd.edu
> >>Cc: r-help at stat.math.ethz.ch
> >>Subject: Re: [R] Issue with predict() for glm models
> >>
> >>jrausch at nd.edu wrote:
> >>
> >>
> >>>Hello everyone,
> >>>
> >>>I am having a problem using the predict (or the
> >>
> >>predict.glm) function in R.
> >>
> >>>Basically, I run the glm model on a "training" data set and try to 
> >>>obtain predictions for a set of new predictors from a
> >>
> >>"test" data set
> >>
> >>>(i.e., not the predictors that were utilized to obtain the
> >>
> >>glm parameter estimates).
> >>
> >>>Unfortunately, every time that I attempt this, I obtain the 
> >>>predictions for the predictors that were used to fit the
> >>
> >>glm model. I
> >>
> >>>have looked at the R mailing list archives and don't believe I am 
> >>>making the same mistakes that have been made in the past
> >>
> >>and also have
> >>
> >>>tried to closely follow the predict.glm example in the help
> >>
> >>file. Here is an example of what I am trying to do:
> >>
> >>>########################################################
> >>>set.seed(545345)
> >>>
> >>>################
> >>># Necessary Variables #
> >>>################
> >>>
> >>>p <- 2
> >>>train.n <- 20
> >>>test.n <- 25
> >>>mean.vec.1 <- c(1,1)
> >>>mean.vec.2 <- c(0,0)
> >>>
> >>>Sigma.1 <- matrix(c(1,.5,.5,1),p,p)
> >>>Sigma.2 <- matrix(c(1,.5,.5,1),p,p)
> >>>
> >>>###############
> >>># Load MASS Library #
> >>>###############
> >>>
> >>>library(MASS)
> >>>
> >>>###################################
> >>># Data to Parameters for Logistic Regression Model # 
> >>>###################################
> >>>
> >>>train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
> >>>train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
> >>>train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n)))
> >>>predictors.train <- rbind(train.data.1,train.data.2)
> >>>
> >>>##############################################
> >>># Test Data Where Predictions for Probabilities Using
> >>
> >>Logistic Reg.  #
> >>
> >>># From Training Data are of Interest                        
> >>
> >>                  #
> >>
> >>>##############################################
> >>>
> >>>test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
> >>>test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
> >>>predictors.test <- rbind(test.data.1,test.data.2)
> >>>
> >>>##############################
> >>># Run Logistic Regression on Training Data # 
> >>>##############################
> >>>
> >>>log.reg <- glm(train.class.var~predictors.train,
> >>>family=binomial(link="logit"))
> >>
> >>Well, you haven't specified the "data" argument, but given the two 
> >>variables directly. Exactly those variables will be used in the
> >>predict() step below! If you want the predict() step to work, use 
> >>something like:
> >>
> >>   train <- data.frame(class = train.class.var,
> >>                       predictors = predictors.train)
> >>   log.reg <- glm(class ~ ., data = train,
> >>                  family=binomial(link="logit"))
> >>
> >>
> >>
> >>
> >>>log.reg
> >>>
> >>>#> log.reg
> >>>
> >>>#Call:  glm(formula = train.class.var ~ predictors.train, family = 
> >>>#binomial(link = "logit")) #
> >>>#Coefficients:
> >>>#      (Intercept)  predictors.train1  predictors.train2  
> >>>#           0.5105            -0.2945            -1.0811  
> >>>#
> >>>#Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
> >>>#Null Deviance:      55.45 
> >>>#Residual Deviance: 41.67        AIC: 47.67 
> >>>
> >>>###########################
> >>># Predicted Probabilities for Test Data #
> >>
> >>###########################
> >>
> >>>New.Data <- data.frame(predictors.train1=predictors.test[,1],
> >>>predictors.train2=predictors.test[,2])
> >>>
> >>>logreg.pred.prob.test <-
> >>
> >>predict.glm(log.reg,New.Data,type="response")
> >>
> >>>logreg.pred.prob.test
> >>
> >>Instead, use:
> >>
> >>   test <- data.frame(predictors = predictors.test)
> >>   predict(log.reg, newdata = test, type="response")
> >>
> >>
> >>note also: please call the generic predict() rather than its glm 
> >>method.
> >>
> >>
> >>Uwe Ligges
> >>
> >>
> >>
> >>>#logreg.pred.prob.test
> >>># [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589
> >>>0.71331091 # [7] 0.17320087 0.14176632 0.30966718 0.61878952
> >>>0.12525988 0.21271139 #[13] 0.70068113 0.18340723 0.10295501
> >>>0.44591568 0.72285161 0.31499339 #[19] 0.65789420 0.42750139
> >>>0.14435889 0.93008117 0.70798465 0.80109005 #[25] 0.89161472
> >>>0.47480625 0.56520952 0.63981834 0.57595189 0.60075882 #[31]
> >>>0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955 
> >>>#[37] 0.80855782 0.90835588 0.54809117 0.11568637 
> >>>########################################################
> >>>
> >>>Of course, notice that the vector for the predicted
> >>
> >>probabilities has
> >>
> >>>only 40 elements, while the "New.Data" has 50 elements
> >>
> >>(since n.test
> >>
> >>>has 25 per group for 2 groups) and thus should have 50 predicted 
> >>>probabilities. As it turns out, the output is for the 
> training data 
> >>>predictors and not for the "New.Data" as I would like it to be. I 
> >>>should also note that I have made sure that the names for the 
> >>>predictors in the "New.Data" are the same as the names for the 
> >>>predictors within the glm object (i.e., within "log.reg")
> >>
> >>as this is what is done in the example for predict.glm() within the 
> >>help files.
> >>
> >>>Could some one help me understand either what I am doing
> >>
> >>incorrectly
> >>
> >>>or what problems there might be within the predict() function? I 
> >>>should mention that I tried the same program using
> >>
> >>predict.glm() and
> >>
> >>>obtained the same problematic results.
> >>>
> >>>Thanks and take care,
> >>>
> >>>Joe
> >>>
> >>>
> >>>Joe Rausch, M.A. 
> >>>Psychology Liaison
> >>>Lab for Social Research
> >>>917 Flanner Hall
> >>>University of Notre Dame
> >>>Notre Dame, IN 46556
> >>>(574) 631-3910
> >>>
> >>>"If we knew what it was we were doing, it would not be called 
> >>>research, would it?"
> >>>- Albert Einstein
> >>>
> >>>______________________________________________
> >>>R-help at stat.math.ethz.ch mailing list 
> >>>https://stat.ethz.ch/mailman/listinfo/r-help
> >>>PLEASE do read the posting guide! 
> >>>http://www.R-project.org/posting-guide.html
> >>
> >>______________________________________________
> >>R-help at stat.math.ethz.ch mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide! 
> >>http://www.R-project.org/posting-guide.html
>