[R] (gam) formula: Why different results for terms being factor vs. numeric?

Marius Hofert marius.hofert at math.ethz.ch
Sat Nov 2 22:11:10 CET 2013

Dear Bert,

Thanks for helping.

'group' is a factor. My question was why I don't get the expected
behavior if 'group' is not a factor.

>From a theoretical (non-programming) point of view, there is no
difference in a factor with two levels and a 0-1 (bool/integer)
variable (in my case the 1-2 variable 'group'). gam() interprets
these inputs differently, thus distinguishes these cases (and I
was wondering why; In my opinion, this is a purely R/mgcv related
question and belongs here).

As it turned out, the problem was merely the following: By using
factors and thus specifying a GAM, the intercept was 'hidden' in
the estimated coefficients. When using integers as group
variables, this is a glm and there one needs the intercept. The
examples below provide the details.

With best wishes,

Marius

require(mgcv)
n <- 10
yrs <- 2000+seq_len(n)
loss <- c(seq_len(n)+runif(n), 5+seq_len(n)+runif(n))

## Version 1: gam() with 'group' as factor #####################################

set.seed(271)
dat <- data.frame(year  = rep(yrs, 2),
group = as.factor(rep(1:2, each=n)), # could also be "A", "B"
resp  = loss)
fit1 <- glm(resp ~ year + group - 1, data=dat)
plot(yrs, fit1\$fitted.values[seq_len(n)], type="l", ylim=range(dat\$resp),
xlab="Year", ylab="Response") # fit group A; mean over all
responses in this group
lines (yrs, fit1\$fitted.values[n+seq_len(n)], col="blue") # fit group
B; mean over all responses in this group
points(yrs, dat\$resp[seq_len(n)]) # actual response group A
points(yrs, dat\$resp[n+seq_len(n)], col="blue") # actual response group B

## Version 2: gam() with 'group' as numeric (=> glm) ###########################

set.seed(271)
dat <- data.frame(year  = rep(yrs, 2),
group = rep(1:2, each=n), # could also be 0:1
resp  = loss)
fit2 <- glm(resp ~ year + group - 1, data=dat) # (*)
plot(yrs, fit2\$fitted.values[seq_len(n)], type="l", ylim=range(dat\$resp),
xlab="Year", ylab="Response") # fit group A; mean over all
responses in this group
lines (yrs, fit2\$fitted.values[n+seq_len(n)], col="blue") # fit group
B; mean over all responses in this group
points(yrs, dat\$resp[seq_len(n)]) # actual response group A
points(yrs, dat\$resp[n+seq_len(n)], col="blue") # actual response group B

## Note: without '-1' (intercept) in (*), an unexpected behavior results
## Explanation:
## S. Wiki GAM (without beta_0):
##    g(E(Y)) = f_1(x_1) + f_2(x_2)
## where f_i(x_i) may be functions with a specified parametric form
(for example a
## polynomial, or a coefficient depending on the levels of a factor variable)
## => for f_i's being coefficients (numbers) beta_i, this is a GLM:
##    g(E(Y)) = beta_1 x_1 + beta_2 x_2 (x_1 = year, x_2 = group)
## Problem: (*) does not specify an intercept and thus the lines are
not picked up correctly
fit2\$coefficients

## Version 3: Version 2 with intercept #########################################

set.seed(271)
dat <- data.frame(year  = rep(yrs, 2),
group = rep(1:2, each=n), # could also be 0:1
resp  = loss)
fit3 <- glm(resp ~ year + group, data=dat) # now with intercept
plot(yrs, fit3\$fitted.values[seq_len(n)], type="l", ylim=range(dat\$resp),
xlab="Year", ylab="Response") # fit group A; mean over all
responses in this group
lines (yrs, fit3\$fitted.values[n+seq_len(n)], col="blue") # fit group
B; mean over all responses in this group
points(yrs, dat\$resp[seq_len(n)]) # actual response group A
points(yrs, dat\$resp[n+seq_len(n)], col="blue") # actual response group B
## => correct/as expected
fit3\$coefficients

## Note: in Version 1, the intercept is already included in the group
coefficients:
fit1\$coefficients

On Tue, Oct 29, 2013 at 9:31 PM, Bert Gunter <gunter.berton at gene.com> wrote:
> Think about it. How can one define a smooth term with a factor???
>
> Further discussion is probably offtopic. Post on
> stats.stackexchange.com if it still isn't obvious.
>
> Cheers,
> Bert