Handling of offsets in glm is really inconsistent.
Prof Brian D Ripley
ripley@stats.ox.ac.uk
Sat, 22 Aug 1998 18:06:00 +0100 (BST)
[Copied to R-devel for information]
This applies to all versions of R I have: 0.62.2, 0.62.3, 0.63.
Great care seems needed with glms with offsets, as many things seem
wrong.
Consider the following:
> data(freeny)
> freeny.glm <- glm(y ~ offset(lag.quarterly.revenue) + price.index +
income.level + market.potential, data=freeny, subset=1:30)
> predict(freeny.glm)
Qtr1 Qtr2 Qtr3 Qtr4
1962: NA 0.01040457 0.01073223 0.01233351
1963: 0.01211730 0.02744293 0.03259214 0.03225403
1964: 0.03235081 0.03272723 0.03361984 0.03114351
1965: 0.03128098 0.03371820 0.02869783 0.02670215
1966: 0.02346093 0.02172193 0.02269731 0.01927851
1967: 0.02516107 0.02496217 0.02679227 0.03099460
1968: 0.03225807 0.03418791 0.03857815 0.03324247
1969: 0.02575733 0.02702418 0.02988582 NA
> predict(freeny.glm, type="response")
Qtr1 Qtr2 Qtr3 Qtr4
1962: NA 8.806765 8.803092 8.803704
1963: 8.826977 8.840453 8.940102 8.968984
1964: 8.993961 8.993167 9.042300 9.061634
1965: 9.100341 9.092428 9.135678 9.153552
1966: 9.194421 9.208372 9.260927 9.284149
1967: 9.309521 9.338742 9.377042 9.389345
1968: 9.429928 9.455688 9.480808 9.520452
1969: 9.549497 9.566824 9.611116 NA
so one might think that prediction of a glm with an offset should
include the offset on the response scale but not link scale. However,
> predict(freeny.glm, newdata=freeny)
1962.25 1962.5 1962.75 1963 1963.25
0.01040457 0.01073223 0.01233351 0.01211730 0.02744293
1963.5 1963.75 1964 1964.25 1964.5
0.03259214 0.03225403 0.03235081 0.03272723 0.03361984
1964.75 1965 1965.25 1965.5 1965.75
0.03114351 0.03128098 0.03371820 0.02869783 0.02670215
1966 1966.25 1966.5 1966.75 1967
0.02346093 0.02172193 0.02269731 0.01927851 0.02516107
1967.25 1967.5 1967.75 1968 1968.25
0.02496217 0.02679227 0.03099460 0.03225807 0.03418791
1968.5 1968.75 1969 1969.25 1969.5
0.03857815 0.03324247 0.02575733 0.02702418 0.02988582
1969.75 1970 1970.25 1970.5 1970.75
0.02686281 0.03816228 0.03910487 0.04325638 0.03599068
1971 1971.25 1971.5 1971.75
0.03357946 0.03890549 0.04196893 0.03952385
> predict(freeny.glm, newdata=freeny, type="response")
1962.25 1962.5 1962.75 1963 1963.25
0.01040457 0.01073223 0.01233351 0.01211730 0.02744293
1963.5 1963.75 1964 1964.25 1964.5
0.03259214 0.03225403 0.03235081 0.03272723 0.03361984
1964.75 1965 1965.25 1965.5 1965.75
0.03114351 0.03128098 0.03371820 0.02869783 0.02670215
1966 1966.25 1966.5 1966.75 1967
0.02346093 0.02172193 0.02269731 0.01927851 0.02516107
1967.25 1967.5 1967.75 1968 1968.25
0.02496217 0.02679227 0.03099460 0.03225807 0.03418791
1968.5 1968.75 1969 1969.25 1969.5
0.03857815 0.03324247 0.02575733 0.02702418 0.02988582
1969.75 1970 1970.25 1970.5 1970.75
0.02686281 0.03816228 0.03910487 0.04325638 0.03599068
1971 1971.25 1971.5 1971.75
0.03357946 0.03890549 0.04196893 0.03952385
and prediction on either scale with newdata ignores the offset. Now, S is
also inconsistent (prior to S-PLUS 4.5 rel2), but at least it does usually
include the offset (except for prediction on link scale with no newdata,
where the offset was omitted until recently). [The different print layout
is due to the original response being a time series of class "ts"; the
predict method cannot know that.]
The discrepancy is in predict.lm (not predict.glm) which ignores offsets in
R but includes them in S. Correcting it is made difficult by the way
delete.response also deletes offsets in R:
> terms(freeny.glm)
y ~ offset(lag.quarterly.revenue) + price.index + income.level +
market.potential
attr(,"variables")
list(y, offset(lag.quarterly.revenue), price.index, income.level,
market.potential)
attr(,"factors")
price.index income.level market.potential
y 0 0 0
offset(lag.quarterly.revenue) 0 0 0
price.index 1 0 0
income.level 0 1 0
market.potential 0 0 1
attr(,"term.labels")
[1] "price.index" "income.level" "market.potential"
attr(,"order")
[1] 1 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,"offset")
[1] 2
> delete.response(terms(freeny.glm))
~price.index + income.level + market.potential
attr(,"variables")
list(price.index, income.level, market.potential)
attr(,"factors")
price.index income.level market.potential
price.index 1 0 0
income.level 0 1 0
market.potential 0 0 1
attr(,"term.labels")
[1] "price.index" "income.level" "market.potential"
attr(,"order")
[1] 1 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 0
and that is definitely not what is documented to happen. I do not begin to
understand why delete.response is written the way it is: it looks like it
needs a complete re-design.
So
- delete.response needs to only delete responses.
- predict.lm needs to handle offsets (or predict.glm, but it is
much cleaner in predict.lm).
- glm.fit should return linear.predictors = eta + offset.
There is another small problem:
> predict(freeny.glm, se=T)
Error: Object "price.index" not found
as predict.lm does not recognize that newdata was missing in the caller
(how lazy is lazy evaluation?) The simplest way out is to have
if (missing(newdata) || is.null(newdata)) X <- model.matrix(object)
the alternative is to test missing(newdata) in predict.glm.
While this is being done, I wonder why R does not implement offsets
for lm()? It `looks like an easy exercise'.
--
Brian D. Ripley, ripley@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._