[R] (gam) formula: Why different results for terms being factor vs. numeric?
Marius Hofert
marius.hofert at math.ethz.ch
Sat Nov 2 22:11:10 CET 2013
Dear Bert,
Thanks for helping.
Your questions 'answers' why I get the expected behavior if
'group' is a factor. My question was why I don't get the expected
behavior if 'group' is not a factor.
>From a theoretical (non-programming) point of view, there is no
difference in a factor with two levels and a 0-1 (bool/integer)
variable (in my case the 1-2 variable 'group'). gam() interprets
these inputs differently, thus distinguishes these cases (and I
was wondering why; In my opinion, this is a purely R/mgcv related
question and belongs here).
As it turned out, the problem was merely the following: By using
factors and thus specifying a GAM, the intercept was 'hidden' in
the estimated coefficients. When using integers as group
variables, this is a glm and there one needs the intercept. The
examples below provide the details.
With best wishes,
Marius
require(mgcv)
n <- 10
yrs <- 2000+seq_len(n)
loss <- c(seq_len(n)+runif(n), 5+seq_len(n)+runif(n))
## Version 1: gam() with 'group' as factor #####################################
set.seed(271)
dat <- data.frame(year = rep(yrs, 2),
group = as.factor(rep(1:2, each=n)), # could also be "A", "B"
resp = loss)
fit1 <- glm(resp ~ year + group - 1, data=dat)
plot(yrs, fit1$fitted.values[seq_len(n)], type="l", ylim=range(dat$resp),
xlab="Year", ylab="Response") # fit group A; mean over all
responses in this group
lines (yrs, fit1$fitted.values[n+seq_len(n)], col="blue") # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col="blue") # actual response group B
## Version 2: gam() with 'group' as numeric (=> glm) ###########################
set.seed(271)
dat <- data.frame(year = rep(yrs, 2),
group = rep(1:2, each=n), # could also be 0:1
resp = loss)
fit2 <- glm(resp ~ year + group - 1, data=dat) # (*)
plot(yrs, fit2$fitted.values[seq_len(n)], type="l", ylim=range(dat$resp),
xlab="Year", ylab="Response") # fit group A; mean over all
responses in this group
lines (yrs, fit2$fitted.values[n+seq_len(n)], col="blue") # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col="blue") # actual response group B
## Note: without '-1' (intercept) in (*), an unexpected behavior results
## Explanation:
## S. Wiki GAM (without beta_0):
## g(E(Y)) = f_1(x_1) + f_2(x_2)
## where f_i(x_i) may be functions with a specified parametric form
(for example a
## polynomial, or a coefficient depending on the levels of a factor variable)
## => for f_i's being coefficients (numbers) beta_i, this is a GLM:
## g(E(Y)) = beta_1 x_1 + beta_2 x_2 (x_1 = year, x_2 = group)
## Problem: (*) does not specify an intercept and thus the lines are
not picked up correctly
fit2$coefficients
## Version 3: Version 2 with intercept #########################################
set.seed(271)
dat <- data.frame(year = rep(yrs, 2),
group = rep(1:2, each=n), # could also be 0:1
resp = loss)
fit3 <- glm(resp ~ year + group, data=dat) # now with intercept
plot(yrs, fit3$fitted.values[seq_len(n)], type="l", ylim=range(dat$resp),
xlab="Year", ylab="Response") # fit group A; mean over all
responses in this group
lines (yrs, fit3$fitted.values[n+seq_len(n)], col="blue") # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col="blue") # actual response group B
## => correct/as expected
fit3$coefficients
## Note: in Version 1, the intercept is already included in the group
coefficients:
fit1$coefficients
On Tue, Oct 29, 2013 at 9:31 PM, Bert Gunter <gunter.berton at gene.com> wrote:
> Think about it. How can one define a smooth term with a factor???
>
> Further discussion is probably offtopic. Post on
> stats.stackexchange.com if it still isn't obvious.
>
> Cheers,
> Bert
More information about the R-help
mailing list