[Rd] "simulate" does not include variability in parameter estimation

Fri Dec 27 05:14:03 CET 2019

Hello, All:

       The default "simulate" method for lm and glm seems to ignore the 
sampling variance of the parameter estimates;  see the trivial lm and 
glm examples below.  Both these examples estimate a mean with formula = 
x~1.  In both cases, the variance of the estimated mean is 1.

             * In the lm example with x0 = c(-1, 1), var(x0) = 2, and 
var(unlist(simulate(lm(x0~1), 10000, 1))) is 2.0064.  Shouldn't it be 3 
= var(mean(x0)) + var(x0) = (2/2) + 2?

             * In the glm example with x1=1, 
var(unlist(simulate(glm(x1~1, poisson), 10000, 1))) = 1.006. Shouldn't 
it be 2 = var(glm estimate of the mean) + var(simulated Poisson 
distribution) = 1 + 1?

       I'm asking, because I've recently written "simulate" methods for 
objects of class stats::glm and BMA::bic.glm, where my primary interest 
was simulating the predicted mean with "newdata".  I'm doing this, so I 
can get Monte Carlo prediction intervals.  My current code for 
"simulate.glm" and "simulate.bic.glm" are available in the development 
version of the "Ecfun" package on GitHub:

https://github.com/sbgraves237/Ecfun

       Comparing my new code with "stats:::simulate.lm" raises the 
following questions in my mind regarding "simulate" of a fit object:

             1.  Shouldn't "simulate" start by simulating the random 
variability in the estimated parameters?  I need that for my current 
application.  If a generic "simulate" function should NOT include this, 
what should we call something that does include this?  And how does the 
current stats:::simulate.lm behavior fit with this?

             2.  Shouldn't "simulate" of a model fit include an option 
for "newdata"?  I need that for my application.

             3.  By comparing with "predict.glm", I felt I needed an 
argument 'type = c("link", "response")'.  "predict.glm" has an argument 
'type = c("link", "response", "terms")'.  I didn't need "terms", so I 
didn't take the time to code that.  However, a general "simulate" 
function should probably include that.

             4.  My application involves assumed Poisson counts.  I need 
to simulate those as well.  If I combined those with "simulate.glm", 
what would I call them?  I can't use the word "response", because that's 
already used with a different meaning. Might "observations" be the 
appropriate term?

       What do you think?
       Thanks,
       Spencer Graves

 > x0 <- c(-1, 1)
 > var(x0)
[1] 2
 > fit0 <- lm(x0~1)
 > vcov(fit0)
             (Intercept)
(Intercept)           1
 > sim0 <- simulate(fit0, 10000, 1)
 > var(unlist(sim0))
[1] 2.006408
 > x1 <- 1
 > fit1 <- glm(x1~1, poisson)
 > coef(fit1)
  (Intercept)
4.676016e-11
 > exp(coef(fit1))
(Intercept)
           1
 > vcov(fit1)
             (Intercept)
(Intercept)   0.9999903
 > sim1 <- simulate(fit1, 10000, 1)
 > var(unlist(sim1))
[1] 1.00617
 > sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.2

Matrix products: default
BLAS: 
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: 
/Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

loaded via a namespace (and not attached):
[1] compiler_3.6.2 tools_3.6.2