[R] Issue of reproducibility with gam and lm.wfit in different versions of R
Sebastien Bihorel
sebastien.bihorel at cognigencorp.com
Tue Oct 24 03:57:52 CEST 2017
Dear R users,
I recently stumbled upon problems of reproducibility while running GAM analyses in different R and gam package versions. In the example below, a small dataset is created in which the y and x1 variables are 100% correlated. The intents of this example were primarily for regression testing and, secondarily, to evaluate how the gam algorithm behaves under extreme/limit conditions.
I ran this little snippet in 5 different environments and got 100% consistent results until I switched to R 3.3.2.
* Comparing results from environments 1, 2, and 3 shows that changing the version of the gam package did not change the results under R 3.3.0.
* Comparing results from environments 3 and 4 shows that changing the version of R altered the values of the AIC and the output of the step.gam call (changed to a NULL object)
* Comparing results from environments 4 and 5 shows that reverting to an older version of the gam package in R 3.3.2 still produced altered AIC values and the NULL output from step.gam call
Further investigations into these differences seem to show that the lm.wfit call in the gam.fit function (called from within gam and step.gam) may result in different values in R 3.3.0 and 3.3.2.
So my questions are 2-fold:
1- Would you have any information about why lm.wfit would produce different outcomes in R 3.3.0 and 3.3.2? The source code does not appear significantly different.
2- Is it expected that the step.gam function returns a NULL object in R 3.3.2 while a valid model has been identified? Looking at the source of step.gam, the line 157 (if(is.null(form.list)) break) seems to be the reason the function breaks out and returns a NULL value.
I thank you in advance for your time.
Sebastien Bihorel
library(gam)
dat <- data.frame(
y = c(57,57,98,83,122,69,108,86,80,87,75,76,97,101,121,111,105,84,65,54,61,71,125,60,50,112,102,110,77,45,93,62,120,115,70,113,117,85,46,123,89,95,116,55,110,92,109,100,72,88,105,119,94,45,67,58,60,45,107,73,100,79,47,99,51,53,68,125,90,48,82,85,65,52,70,59,125,49,118,103,91,124,78,81,63,63),
x1 = c(52,52,93,78,117,64,103,81,75,82,70,71,92,96,116,106,100,79,60,49,56,66,120,55,45,107,97,105,72,40,88,57,115,110,65,108,112,80,41,118,84,90,111,50,105,87,104,95,67,83,100,114,89,40,62,53,55,40,102,68,95,74,42,94,46,48,63,120,85,43,77,80,60,47,65,54,120,44,113,98,86,119,73,76,58,58),
x2 = c(0.0001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0001,0)
)
summary(dat)
scope <- list(
x1=c('1','x1','ns(x1, df=2)'),
x2=c('1','x2')
)
gam.object <- gam(y~1, data=dat)
step.object <- step.gam(gam.object, scope=scope, trace=2)
is.null(step.object)
# Run this if you want to evaluate one lm.wfit example call made from within gam functions
x <- cbind(rep(1,nrow(dat)), dat$x1)
names(x) <- c('Intercept', 'x1')
z <- dat$y
w <- rep(1,nrow(dat))
str(eval(expression(lm.wfit(x, z, w, method = "qr", singular.ok = TRUE))))
######################################################################
#1 R 3.1.2 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.09.1
######################################################################
> step.object <- step.gam(gam.object, scope=scope, trace=2)
Start: y ~ 1; AIC= 797.0034
Trial: y ~ x1 + 1 ; AIC= -5121.796
Trial: y ~ 1 + x2 ; AIC= 796.915
Step:1 y ~ x1 ; AIC= -5121.796
Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5123.816
Trial: y ~ x1 + x2 ; AIC= -5174.408
Step:2 y ~ x1 + x2 ; AIC= -5174.408
Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5164.829
> is.null(step.object)
[1] FALSE
######################################################################
#2: R 3.3.0 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.12
######################################################################
> step.object <- step.gam(gam.object, scope=scope, trace=2)
Start: y ~ 1; AIC= 797.0034
Trial: y ~ x1 + 1 ; AIC= -5121.796
Trial: y ~ 1 + x2 ; AIC= 796.915
Step:1 y ~ x1 ; AIC= -5121.796
Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5123.816
Trial: y ~ x1 + x2 ; AIC= -5174.408
Step:2 y ~ x1 + x2 ; AIC= -5174.408
Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5164.829
> is.null(step.object)
[1] FALSE
######################################################################
#3: R 3.3.0 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.14-4
######################################################################
> step.object <- step.gam(gam.object, scope=scope, trace=2)
Start: y ~ 1; AIC= 797.0034
Trial: y ~ x1 + 1 ; AIC= -5121.796
Trial: y ~ 1 + x2 ; AIC= 796.915
Step:1 y ~ x1 ; AIC= -5121.796
Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5123.816
Trial: y ~ x1 + x2 ; AIC= -5174.408
Step:2 y ~ x1 + x2 ; AIC= -5174.408
Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5164.829
> is.null(step.object)
[1] FALSE
######################################################################
#4: R 3.3.2 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.14-4
######################################################################
> step.object <- step.gam(gam.object, scope=scope, trace=2)
Start: y ~ 1; AIC= 797.0034
Trial: y ~ x1 + 1 ; AIC= -5122.886
Trial: y ~ 1 + x2 ; AIC= 796.915
Step:1 y ~ x1 ; AIC= -5122.886
Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5177.531
Trial: y ~ x1 + x2 ; AIC= -5177.531
Step:2 y ~ ns(x1, df = 2) ; AIC= -5177.531
Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5222.703
Step:3 y ~ ns(x1, df = 2) + x2 ; AIC= -5222.703
> is.null(step.object)
[1] TRUE
######################################################################
#5: R 3.3.2 (x86_64-redhat-linux-gnu (64-bit)) / gam 1.09.1
######################################################################
> step.object <- step.gam(gam.object, scope=scope, trace=2)
Start: y ~ 1; AIC= 797.0034
Trial: y ~ x1 + 1 ; AIC= -5122.886
Trial: y ~ 1 + x2 ; AIC= 796.915
Step:1 y ~ x1 ; AIC= -5122.886
Trial: y ~ ns(x1, df=2) + 1 ; AIC= -5177.531
Trial: y ~ x1 + x2 ; AIC= -5177.531
Step:2 y ~ ns(x1, df = 2) ; AIC= -5177.531
Trial: y ~ ns(x1, df=2) + x2 ; AIC= -5222.703
Step:3 y ~ ns(x1, df = 2) + x2 ; AIC= -5222.703
> is.null(step.object)
[1] TRUE
More information about the R-help
mailing list