followup: Re: [R] Issue with predict() for glm models

Thu Sep 23 19:02:57 CEST 2004

I have a follow up question that fits with this thread.

Can you force an overlaid plot showing predicted values to follow the 
scaling of the axes of the plot over which it is laid?

Here is an example based on linear regression, just for clarity.  I have 
followed the procedure described below to create predictions and now 
want to plot the predicted values "on top" of a small section of the x-y 
scatterplot.

x <- rnorm(100, 10, 10)
e <- rnorm(100, 0, 5)
y <- 5 + 10 *x + e

myReg1 <- lm (y~x)
plot(x,y)
newX <- seq(1,10,1)
myPred <- predict(myReg1,data.frame(x=newX))

Now, if I do this, I get 2 graphs "overlaid" but their axes do not "line 
up".

par(new=T)
plot(newX,myPred$fit)

The problem is that the second one uses the "whole width" of the graph 
space, when I'd rather just have it go from the small subset where its x 
is defined, from 1 to 10.  Its stretching the range (1,10) for newX to 
use the same scale that goes from (-15, 35) where it plots x

I know abline() can do this for lm, but for some other kinds of models, 
no  lines() method is provided, and so I am doing this the old fashioned 
way.

pj

John Fox wrote:
> Dear Uwe, 
> 
> 
>>-----Original Message-----
>>From: Uwe Ligges [mailto:ligges at statistik.uni-dortmund.de] 
>>Sent: Thursday, September 23, 2004 8:06 AM
>>To: John Fox
>>Cc: jrausch at nd.edu; r-help at stat.math.ethz.ch
>>Subject: Re: [R] Issue with predict() for glm models
>>
>>John Fox wrote:
>>
>>
>>>Dear Uwe,
>>>
>>>Unless I've somehow messed this up, as I mentioned 
>>
>>yesterday, what you 
>>
>>>suggest doesn't seem to work when the predictor is a 
>>
>>matrix. Here's a 
>>
>>>simplified example:
>>>
>>>
>>>
>>>>X <- matrix(rnorm(200), 100, 2)
>>>>y <- (X %*% c(1,2) + rnorm(100)) > 0
>>>>dat <- data.frame(y=y, X=X)
>>>>mod <- glm(y ~ X, family=binomial, data=dat) new <- data.frame(X = 
>>>>matrix(rnorm(20),2)) predict(mod, new)
>>
>>Dear John,
>>
>>the questioner had a 2 column matrix with 40 and one with 50 
>>observations (not a 100 column matrix with 2 observation) and 
>>for those matrices it works ...
>>
> 
> 
> Indeed, and in my example the matrix predictor X has 2 columns and 100 rows;
> I did screw up the matrix for the "new" data to be used for predictions (in
> the example I sent today but not yesterday), but even when this is done
> right -- where the new data has 10 rows and 2 columns -- there are 100 (not
> 10) predicted values:
> 
> 
>>X <- matrix(rnorm(200), 100, 2)  # original predictor matrix with 100 rows
>>y <- (X %*% c(1,2) + rnorm(100)) > 0
>>dat <- data.frame(y=y, X=X)
>>mod <- glm(y ~ X, family=binomial, data=dat)
>>new <- data.frame(X = matrix(rnorm(20),10, 2)) # corrected -- note 10 rows
>>predict(mod, new) # note 100 predicted values
> 
>            1            2            3            4            5
> 6 
>   5.75238091   0.31874587  -3.00515893  -3.77282121  -1.97511221
> 0.54712914 
>            7            8            9           10           11
> 12 
>   1.85091226   4.38465524  -0.41028694  -1.53942869   0.57613555
> -1.82761518 
> 
>  . . .
> 
>           91           92           93           94           95
> 96 
>   0.36210780   1.71358713  -9.63612775  -4.54257576  -5.29740468
> 2.64363405 
>           97           98           99          100 
>  -4.45478627  -2.44973209   2.51587537  -4.09584837 
> 
> Actually, I now see the source of the problem:
> 
> The data frames dat and new don't contain a matrix named "X"; rather the
> matrix is split columnwise:
> 
> 
>>names(dat)
> 
> [1] "y"   "X.1" "X.2"
> 
>>names(new)
> 
> [1] "X.1" "X.2"
> 
> Consequently, both glm and predict pick up the X in the global environment
> (since there is none in dat or new), which accounts for why there are 100
> predicted values.
> 
> Using list() rather than data.frame() produces the originally expected
> behaviour:
> 
> 
>>new <- list(X = matrix(rnorm(20),10, 2))
>>predict(mod, new)
> 
>          1          2          3          4          5          6          7
> 
>  5.9373064  0.3687360 -8.3793045  0.7645584 -2.6773842  2.4130547  0.7387318
> 
>          8          9         10 
> -0.4347916  8.4678728 -0.8976054 
> 
> Regards,
>  John
> 
> 
>>Best,
>>Uwe
>>
>>
>>
>>
>>
>>
>>
>>
>>>           1            2            3            4            5
>>>6 
>>>  1.81224443  -5.92955128   1.98718051 -10.05331521   2.65065555
>>>-2.50635812 
>>>           7            8            9           10           11
>>>12 
>>>  5.63728698  -0.94845276  -3.61657377  -1.63141320   5.03417372
>>>1.80400271 
>>>          13           14           15           16           17
>>>18 
>>>  9.32876273  -5.32723406   5.29373023  -3.90822713 -10.95065186
>>>4.90038016
>>>
>>> . . .
>>>
>>>           97           98           99          100 
>>> -6.92509812   0.59357486  -1.17205723   0.04209578 
>>>
>>>
>>>Note that there are 100 rather than 10 predicted values.
>>>
>>>But with individuals predictors (rather than a matrix),
>>>
>>>
>>>
>>>>x1 <- X[,1]
>>>>x2 <- X[,2]
>>>>dat.2 <- data.frame(y=y, x1=x1, x2=x2)
>>>>mod.2 <- glm(y ~ x1 + x2, family=binomial, data=dat.2)
>>>>new.2 <- data.frame(x1=rnorm(10), x2=rnorm(10)) 
>>
>>predict(mod.2, new.2)
>>
>>>         1          2          3          4          5      
>>
>>    6          7
>>
>>> 6.5723823  0.6356392  4.0291018 -4.7914650  2.1435485 -3.1738096 
>>>-2.8261585
>>>
>>>         8          9         10 
>>>-1.5255329 -4.7087592  4.0619290
>>>
>>>works as expected (?).
>>>
>>>Regards,
>>> John
>>> 
>>>
>>>
>>>
>>>>-----Original Message-----
>>>>From: r-help-bounces at stat.math.ethz.ch 
>>>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Uwe Ligges
>>>>Sent: Thursday, September 23, 2004 1:33 AM
>>>>To: jrausch at nd.edu
>>>>Cc: r-help at stat.math.ethz.ch
>>>>Subject: Re: [R] Issue with predict() for glm models
>>>>
>>>>jrausch at nd.edu wrote:
>>>>
>>>>
>>>>
>>>>>Hello everyone,
>>>>>
>>>>>I am having a problem using the predict (or the
>>>>
>>>>predict.glm) function in R.
>>>>
>>>>
>>>>>Basically, I run the glm model on a "training" data set and try to 
>>>>>obtain predictions for a set of new predictors from a
>>>>
>>>>"test" data set
>>>>
>>>>
>>>>>(i.e., not the predictors that were utilized to obtain the
>>>>
>>>>glm parameter estimates).
>>>>
>>>>
>>>>>Unfortunately, every time that I attempt this, I obtain the 
>>>>>predictions for the predictors that were used to fit the
>>>>
>>>>glm model. I
>>>>
>>>>
>>>>>have looked at the R mailing list archives and don't believe I am 
>>>>>making the same mistakes that have been made in the past
>>>>
>>>>and also have
>>>>
>>>>
>>>>>tried to closely follow the predict.glm example in the help
>>>>
>>>>file. Here is an example of what I am trying to do:
>>>>
>>>>
>>>>>########################################################
>>>>>set.seed(545345)
>>>>>
>>>>>################
>>>>># Necessary Variables #
>>>>>################
>>>>>
>>>>>p <- 2
>>>>>train.n <- 20
>>>>>test.n <- 25
>>>>>mean.vec.1 <- c(1,1)
>>>>>mean.vec.2 <- c(0,0)
>>>>>
>>>>>Sigma.1 <- matrix(c(1,.5,.5,1),p,p)
>>>>>Sigma.2 <- matrix(c(1,.5,.5,1),p,p)
>>>>>
>>>>>###############
>>>>># Load MASS Library #
>>>>>###############
>>>>>
>>>>>library(MASS)
>>>>>
>>>>>###################################
>>>>># Data to Parameters for Logistic Regression Model # 
>>>>>###################################
>>>>>
>>>>>train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1)
>>>>>train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2)
>>>>>train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n)))
>>>>>predictors.train <- rbind(train.data.1,train.data.2)
>>>>>
>>>>>##############################################
>>>>># Test Data Where Predictions for Probabilities Using
>>>>
>>>>Logistic Reg.  #
>>>>
>>>>
>>>>># From Training Data are of Interest                        
>>>>
>>>>                 #
>>>>
>>>>
>>>>>##############################################
>>>>>
>>>>>test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1)
>>>>>test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2)
>>>>>predictors.test <- rbind(test.data.1,test.data.2)
>>>>>
>>>>>##############################
>>>>># Run Logistic Regression on Training Data # 
>>>>>##############################
>>>>>
>>>>>log.reg <- glm(train.class.var~predictors.train,
>>>>>family=binomial(link="logit"))
>>>>
>>>>Well, you haven't specified the "data" argument, but given the two 
>>>>variables directly. Exactly those variables will be used in the
>>>>predict() step below! If you want the predict() step to work, use 
>>>>something like:
>>>>
>>>>  train <- data.frame(class = train.class.var,
>>>>                      predictors = predictors.train)
>>>>  log.reg <- glm(class ~ ., data = train,
>>>>                 family=binomial(link="logit"))
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>log.reg
>>>>>
>>>>>#> log.reg
>>>>>
>>>>>#Call:  glm(formula = train.class.var ~ predictors.train, family = 
>>>>>#binomial(link = "logit")) #
>>>>>#Coefficients:
>>>>>#      (Intercept)  predictors.train1  predictors.train2  
>>>>>#           0.5105            -0.2945            -1.0811  
>>>>>#
>>>>>#Degrees of Freedom: 39 Total (i.e. Null);  37 Residual
>>>>>#Null Deviance:      55.45 
>>>>>#Residual Deviance: 41.67        AIC: 47.67 
>>>>>
>>>>>###########################
>>>>># Predicted Probabilities for Test Data #
>>>>
>>>>###########################
>>>>
>>>>
>>>>>New.Data <- data.frame(predictors.train1=predictors.test[,1],
>>>>>predictors.train2=predictors.test[,2])
>>>>>
>>>>>logreg.pred.prob.test <-
>>>>
>>>>predict.glm(log.reg,New.Data,type="response")
>>>>
>>>>
>>>>>logreg.pred.prob.test
>>>>
>>>>Instead, use:
>>>>
>>>>  test <- data.frame(predictors = predictors.test)
>>>>  predict(log.reg, newdata = test, type="response")
>>>>
>>>>
>>>>note also: please call the generic predict() rather than its glm 
>>>>method.
>>>>
>>>>
>>>>Uwe Ligges
>>>>
>>>>
>>>>
>>>>
>>>>>#logreg.pred.prob.test
>>>>># [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589
>>>>>0.71331091 # [7] 0.17320087 0.14176632 0.30966718 0.61878952
>>>>>0.12525988 0.21271139 #[13] 0.70068113 0.18340723 0.10295501
>>>>>0.44591568 0.72285161 0.31499339 #[19] 0.65789420 0.42750139
>>>>>0.14435889 0.93008117 0.70798465 0.80109005 #[25] 0.89161472
>>>>>0.47480625 0.56520952 0.63981834 0.57595189 0.60075882 #[31]
>>>>>0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955 
>>>>>#[37] 0.80855782 0.90835588 0.54809117 0.11568637 
>>>>>########################################################
>>>>>
>>>>>Of course, notice that the vector for the predicted
>>>>
>>>>probabilities has
>>>>
>>>>
>>>>>only 40 elements, while the "New.Data" has 50 elements
>>>>
>>>>(since n.test
>>>>
>>>>
>>>>>has 25 per group for 2 groups) and thus should have 50 predicted 
>>>>>probabilities. As it turns out, the output is for the 
>>
>>training data 
>>
>>>>>predictors and not for the "New.Data" as I would like it to be. I 
>>>>>should also note that I have made sure that the names for the 
>>>>>predictors in the "New.Data" are the same as the names for the 
>>>>>predictors within the glm object (i.e., within "log.reg")
>>>>
>>>>as this is what is done in the example for predict.glm() within the 
>>>>help files.
>>>>
>>>>
>>>>>Could some one help me understand either what I am doing
>>>>
>>>>incorrectly
>>>>
>>>>
>>>>>or what problems there might be within the predict() function? I 
>>>>>should mention that I tried the same program using
>>>>
>>>>predict.glm() and
>>>>
>>>>
>>>>>obtained the same problematic results.
>>>>>
>>>>>Thanks and take care,
>>>>>
>>>>>Joe
>>>>>
>>>>>
>>>>>Joe Rausch, M.A. 
>>>>>Psychology Liaison
>>>>>Lab for Social Research
>>>>>917 Flanner Hall
>>>>>University of Notre Dame
>>>>>Notre Dame, IN 46556
>>>>>(574) 631-3910
>>>>>
>>>>>"If we knew what it was we were doing, it would not be called 
>>>>>research, would it?"
>>>>>- Albert Einstein
>>>>>
>>>>>______________________________________________
>>>>>R-help at stat.math.ethz.ch mailing list 
>>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>PLEASE do read the posting guide! 
>>>>>http://www.R-project.org/posting-guide.html
>>>>
>>>>______________________________________________
>>>>R-help at stat.math.ethz.ch mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide! 
>>>>http://www.R-project.org/posting-guide.html
>>
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

-- 
Paul E. Johnson                       email: pauljohn at ku.edu
Dept. of Political Science            http://lark.cc.ku.edu/~pauljohn
1541 Lilac Lane, Rm 504
University of Kansas                  Office: (785) 864-9086
Lawrence, Kansas 66044-3177           FAX: (785) 864-5700