[R] Issue with predict() for glm models

Thu Sep 23 15:10:45 CEST 2004

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of John Fox
> Sent: Thursday, September 23, 2004 8:49 AM
> To: 'Uwe Ligges'
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] Issue with predict() for glm models
> 
> 
> Dear Uwe,
> 
> Unless I've somehow messed this up, as I mentioned yesterday, what you
> suggest doesn't seem to work when the predictor is a matrix. Here's a
> simplified example:
> 
> > X <- matrix(rnorm(200), 100, 2)
> > y <- (X %*% c(1,2) + rnorm(100)) > 0
> > dat <- data.frame(y=y, X=X)
> > mod <- glm(y ~ X, family=binomial, data=dat)
> > new <- data.frame(X = matrix(rnorm(20),2))
> > predict(mod, new)
>            1            2            3            4            5
> 6 
>   1.81224443  -5.92955128   1.98718051 -10.05331521   2.65065555
> -2.50635812 
>            7            8            9           10           11
> 12 
>   5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
> 1.80400271 
>           13           14           15           16           17
> 18 
>   9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
> 4.90038016 
> 
>  . . .
> 
>            97           98           99          100 
>  -6.92509812   0.59357486  -1.17205723   0.04209578 
> 
> 
> Note that there are 100 rather than 10 predicted values.
> 
> But with individuals predictors (rather than a matrix),
> 
> > x1 <- X[,1]
> > x2 <- X[,2]
> > dat.2 <- data.frame(y=y, x1=x1, x2=x2)
> > mod.2 <- glm(y ~ x1 + x2, family=binomial, data=dat.2)
> > new.2 <- data.frame(x1=rnorm(10), x2=rnorm(10))
> > predict(mod.2, new.2)
>          1          2          3          4          5        
>   6          7
> 
>  6.5723823  0.6356392  4.0291018 -4.7914650  2.1435485 
> -3.1738096 -2.8261585
> 
>          8          9         10 
> -1.5255329 -4.7087592  4.0619290 
> 
> works as expected (?).
> 
> Regards,
>  John
>  

My apologies: I have not kept up with the earlier messages in this thread,
so this might be covered already:

> X <- matrix(rnorm(200), 100, 2)
> y <- (X %*% c(1,2) + rnorm(100)) > 0
> dat <- data.frame(y=y, X=X)
> mod <- glm(y ~ X, family=binomial, data=dat)
> newmat <- matrix(rnorm(20), ncol=2)
> new <- data.frame(X = newmat)
> new2 <- data.frame(X = I(newmat))
> str(predict(mod, new))
 Named num [1:100] 3.30 8.08 4.65 4.72 2.15 ...
 - attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
> str(predict(mod, new2))
 Named num [1:10]  7.358  0.241 -5.418  2.094 -0.885 ...
 - attr(*, "names")= chr [1:10] "1" "2" "3" "4" ...

The bottom line is that the data frame passed to predict() must have
variables that matches exactly with what's used in the original formula.
Passing `new' doesn't work as expected because it's a data frame with two
variables, rather than one variable whose name is `X' and is a two-column
matrix.

Best,
Andy

> > From: Uwe Ligges
> > 
> > jrausch at nd.edu wrote:
> > 
> > > Hello everyone,
> > > 
> > > I am having a problem using the predict (or the 
> > predict.glm) function in R.
> > > Basically, I run the glm model on a "training" data set 
> and try to 
> > > obtain predictions for a set of new predictors from a 
> > "test" data set 
> > > (i.e., not the predictors that were utilized to obtain the 
> > glm parameter estimates).
> > > Unfortunately, every time that I attempt this, I obtain the 
> > > predictions for the predictors that were used to fit the 
> > glm model. I 
> > > have looked at the R mailing list archives and don't believe I am 
> > > making the same mistakes that have been made in the past 
> > and also have 
> > > tried to closely follow the predict.glm example in the help 
> > file. Here is an example of what I am trying to do:
> > > 
> > > ########################################################
> > > set.seed(545345)
> > > 
> > > ################
> > > # Necessary Variables #
> > > ################
> > > 
> > > p <- 2
> > > train.n <- 20
> > > test.n <- 25
> > > mean.vec.1 <- c(1,1)
> > > mean.vec.2 <- c(0,0)
> > > 
> > > Sigma.1 <- matrix(c(1,.5,.5,1),p,p)
> > > Sigma.2 <- matrix(c(1,.5,.5,1),p,p)
> > > 
> > > ###############
> > > # Load MASS Library #
> > > ###############
> > > 
> > > library(MASS)
> > > 
> > > ###################################
> > > # Data to Parameters for Logistic Regression Model # 
> > > ###################################
> > > 
> > > train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
> > > train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
> > > train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n)))
> > > predictors.train <- rbind(train.data.1,train.data.2)
> > > 
> > > ##############################################
> > > # Test Data Where Predictions for Probabilities Using 
> > Logistic Reg.  #
> > > # From Training Data are of Interest                        
> >                   #
> > > ##############################################
> > > 
> > > test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
> > > test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
> > > predictors.test <- rbind(test.data.1,test.data.2)
> > > 
> > > ##############################
> > > # Run Logistic Regression on Training Data # 
> > > ##############################
> > > 
> > > log.reg <- glm(train.class.var~predictors.train,
> > > family=binomial(link="logit"))
> > 
> > Well, you haven't specified the "data" argument, but given 
> > the two variables directly. Exactly those variables will be 
> > used in the
> > predict() step below! If you want the predict() step to work, 
> > use something like:
> > 
> >    train <- data.frame(class = train.class.var,
> >                        predictors = predictors.train)
> >    log.reg <- glm(class ~ ., data = train,
> >                   family=binomial(link="logit"))
> > 
> > 
> > 
> > > log.reg
> > > 
> > > #> log.reg
> > > 
> > > #Call:  glm(formula = train.class.var ~ predictors.train, 
> family = 
> > > #binomial(link = "logit")) #
> > > #Coefficients:
> > > #      (Intercept)  predictors.train1  predictors.train2  
> > > #           0.5105            -0.2945            -1.0811  
> > > #
> > > #Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
> > > #Null Deviance:      55.45 
> > > #Residual Deviance: 41.67        AIC: 47.67 
> > > 
> > > ###########################
> > > # Predicted Probabilities for Test Data # 
> > ###########################
> > > 
> > > New.Data <- data.frame(predictors.train1=predictors.test[,1],
> > > predictors.train2=predictors.test[,2])
> > > 
> > > logreg.pred.prob.test <- 
> > predict.glm(log.reg,New.Data,type="response")
> > > logreg.pred.prob.test
> > 
> > Instead, use:
> > 
> >    test <- data.frame(predictors = predictors.test)
> >    predict(log.reg, newdata = test, type="response")
> > 
> > 
> > note also: please call the generic predict() rather than its 
> > glm method.
> > 
> > 
> > Uwe Ligges
> > 
> > 
> > > #logreg.pred.prob.test
> > > # [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589 
> > > 0.71331091 # [7] 0.17320087 0.14176632 0.30966718 0.61878952 
> > > 0.12525988 0.21271139 #[13] 0.70068113 0.18340723 0.10295501 
> > > 0.44591568 0.72285161 0.31499339 #[19] 0.65789420 0.42750139 
> > > 0.14435889 0.93008117 0.70798465 0.80109005 #[25] 0.89161472 
> > > 0.47480625 0.56520952 0.63981834 0.57595189 0.60075882 #[31] 
> > > 0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955 
> > > #[37] 0.80855782 0.90835588 0.54809117 0.11568637 
> > > ########################################################
> > > 
> > > Of course, notice that the vector for the predicted 
> > probabilities has 
> > > only 40 elements, while the "New.Data" has 50 elements 
> > (since n.test 
> > > has 25 per group for 2 groups) and thus should have 50 predicted 
> > > probabilities. As it turns out, the output is for the 
> training data 
> > > predictors and not for the "New.Data" as I would like it to be. I 
> > > should also note that I have made sure that the names for the 
> > > predictors in the "New.Data" are the same as the names for the 
> > > predictors within the glm object (i.e., within "log.reg") 
> > as this is what is done in the example for predict.glm() 
> > within the help files.
> > > 
> > > Could some one help me understand either what I am doing 
> > incorrectly 
> > > or what problems there might be within the predict() function? I 
> > > should mention that I tried the same program using 
> > predict.glm() and 
> > > obtained the same problematic results.
> > > 
> > > Thanks and take care,
> > > 
> > > Joe
> > > 
> > > 
> > > Joe Rausch, M.A. 
> > > Psychology Liaison
> > > Lab for Social Research
> > > 917 Flanner Hall
> > > University of Notre Dame
> > > Notre Dame, IN 46556
> > > (574) 631-3910
> > > 
> > > "If we knew what it was we were doing, it would not be called 
> > > research, would it?"
> > > - Albert Einstein
> > > 
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide! 
> > > http://www.R-project.org/posting-guide.html
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > http://www.R-project.org/posting-guide.html
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>