[R] Using predict() After Adding a Factor to a glm.nb() Model

Sat Sep 8 23:21:17 CEST 2012

Actually the first one did not work. You left out the warning message:

> py<-data.frame(seq(from=min(data$year), to=max(data$year), by=1))
> p1<-predict(model_a, newdata=py, se.fit=TRUE, type='response')
Warning message:
'newdata' had 20 rows but variable(s) found have 22 rows

Since py did not contain the explanatory variable "year," predict() threw a
warning and then just used the values in data giving you the same results
that you already have stored in model_a. Compare 

> model_a$fitted.values
        1         2         3         4         5         6         7
8 
61.444110 55.163093 49.524141 15.124074 12.190050 61.444110 44.461622
55.163093 
        9        10        11        12        13        14        15
16 
15.124074 12.190050  7.919156 61.444110 55.163093 49.524141 44.461622
39.916610 
       17        18        19        20        21        22 
35.836204 32.172911 28.884091 25.931465 23.280666 20.900841 
> p1$fit
        1         2         3         4         5         6         7
8 
61.444110 55.163093 49.524141 15.124074 12.190050 61.444110 44.461622
55.163093 
        9        10        11        12        13        14        15
16 
15.124074 12.190050  7.919156 61.444110 55.163093 49.524141 44.461622
39.916610 
       17        18        19        20        21        22 
35.836204 32.172911 28.884091 25.931465 23.280666 20.900841

Set the variable name to year in py and then use predict():

> py<-data.frame(year=seq(from=min(data$year), to=max(data$year), by=1))
> p1<-predict(model_a, newdata=py, se.fit=TRUE, type='response')

Now run your plotting commands. The plot may not look much different, but
now it is based on the 20 evenly spaced values in py instead of the 22
original data values in data.

Now you have the right idea for adding site to py but instead of using the
irregularly spaced data, use evenly spaced values. Then you will have to
plot separate lines for each site:

> model_b<-glm.nb(count~year+site, data=data)
> py2 <- expand.grid(site=letters[1:4], year=1980:1999)
> p2<-predict(model_b, newdata=py2, se.fit=TRUE, type='response')
> py2 <- data.frame(py2, p2)
> plot(count~year, data=data, pch=as.character(site))
> x <- unstack(py2, py2$year~py2$site)
> y <- unstack(py2, py2$fit~py2$site)
> matlines(x, y)
> legend("topright", letters[1:4], col=1:4, lty=1:4)

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of 077315q at acadiau.ca
> Sent: Saturday, September 08, 2012 2:47 PM
> To: r-help at r-project.org
> Subject: [R] Using predict() After Adding a Factor to a glm.nb() Model
> 
> # Hello,
> 
> # I have a data set that looks something like the following:
> 
> site<-c(rep('a',5),rep('b',2),rep('c',4),rep('d',11))
> year<-c(1980, 1981, 1982, 1993, 1995, 1980, 1983, 1981, 1993, 1995,
> 1999, c(1980:1990))
> count<-
> c(60,35,36,12,8,112,98,20,13,15,15,65,43,49,51,34,33,33,33,40,11,0)
> 
> data<-data.frame(site, year, count)
> 
> # > site year count
> # 1     a 1980    60
> # 2     a 1981    35
> # 3     a 1982    36
> # 4     a 1993    12
> # .
> # .
> # .
> 
> # I ran a negative binomial glm using:
> 
> library(MASS)
> model_a<-glm.nb(count~year, data=data)
> 
> # then extracted predicted values using:
> 
> py<-data.frame(seq(from=min(data$year), to=max(data$year), by=1))
> p1<-predict(model_a, newdata=py, se.fit=TRUE, type='response')
> f1<-p1$fit
> 
> plot(count~year, data=data)
> lines(py$year, f1)
> 
> # Works perfectly.
> 
> # Now, I want to add site as a factor:
> model_b<-glm.nb(count~year+as.factor(site), data=data)
> 
> # I have tried a couple different ways of predicting the values based
> on model_b, but am having trouble.
> 
> py<-data.frame(seq(from=min(data$year), to=max(data$year), by=1))
> p1<-predict(model_b, newdata=py, se.fit=TRUE, type='response')
> 
> # gives:
> # >Error in model.frame.default(Terms, newdata, na.action = na.action,
> xlev = object$xlevels) :
> #   variable lengths differ (found for 'as.factor(site)')
> # In addition: Warning message:
> # 'newdata' had 20 rows but variable(s) found have 22 rows
> 
> py<-expand.grid(data$site, data$year)
> names(py)<-c('site','year')
> p1<-predict(model_b, newdata=py)
> 
> # This works, but results in 484 values, and I can't plot a line over
> my points.
> 
> # There is probably a simple solution, but I'm having trouble wrapping
> my mind around it. Mind you, this is also a last
> # minute change to my thesis, and I haven't slept in about three days.
> 
> # Any suggestions? I would be extremely grateful...
> 
> # Banging my head against the wall,
> # A stressed out grad student
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.