[Rd] "simulate" does not include variability in parameter estimation

Fri Dec 27 17:46:34 CET 2019

On 27/12/2019 10:31 a.m., Spencer Graves wrote:
> 
> 
> On 2019-12-27 04:34, Duncan Murdoch wrote:
>> On 26/12/2019 11:14 p.m., Spencer Graves wrote:
>>> Hello, All:
>>>
>>>
>>>          The default "simulate" method for lm and glm seems to ignore the
>>> sampling variance of the parameter estimates;  see the trivial lm and
>>> glm examples below.  Both these examples estimate a mean with formula =
>>> x~1.  In both cases, the variance of the estimated mean is 1.
>>
>> That's how it's documented to operate.  Nothing in the help page
>> suggests it would try to simulate parameter values.  Indeed, it
>> doesn't have enough information on the distribution to sample from:
>> the appropriate distribution to simulate from if you want to include
>> uncertainty in the parameter estimates is the posterior distribution,
>> but lm and glm take a classical point of view, not a Bayesian point of
>> view, so they have no concept of a posterior.
> 
> 
>         Thanks for the reply.  What do you suggest for someone who wants
> confidence, prediction and tolerance intervals for newdata for a general
> fit object?

I can't suggest anything other than to take a Bayesian approach.

Duncan Murdoch

> 
>         For a glm object, one could get confidence intervals starting
> with predicted mean and standard errors
> from predict(glm(...), newdata, type='link', se.fit=TRUE), then linkinv
> to get the confidence intervals on scale of expected values of the
> random variables.  From that one could compute tolerance intervals.
> 
> 
>         Is there a way to get more standard prediction intervals from a
> glm object, other than the Bayesian approach coded into
> Ecfun:::simulate.glm?  And that still doesn't answer the question re.
> confidence intervals for a more general fit object like BMA::bic.glm.
> 
> 
>         Comments?
>         Thanks,
>         Spencer Graves
> 
>>
>>>                * In the lm example with x0 = c(-1, 1), var(x0) = 2, and
>>> var(unlist(simulate(lm(x0~1), 10000, 1))) is 2.0064.  Shouldn't it be 3
>>> = var(mean(x0)) + var(x0) = (2/2) + 2?
>>
>> That calculation ignores the uncertainty in the estimation of sigma.
>>
>> Duncan Murdoch
>>
>>>
>>>
>>>                * In the glm example with x1=1,
>>> var(unlist(simulate(glm(x1~1, poisson), 10000, 1))) = 1.006. Shouldn't
>>> it be 2 = var(glm estimate of the mean) + var(simulated Poisson
>>> distribution) = 1 + 1?
>>>
>>>
>>>          I'm asking, because I've recently written "simulate" methods for
>>> objects of class stats::glm and BMA::bic.glm, where my primary interest
>>> was simulating the predicted mean with "newdata".  I'm doing this, so I
>>> can get Monte Carlo prediction intervals.  My current code for
>>> "simulate.glm" and "simulate.bic.glm" are available in the development
>>> version of the "Ecfun" package on GitHub:
>>>
>>>
>>> https://github.com/sbgraves237/Ecfun
>>>
>>>
>>>          Comparing my new code with "stats:::simulate.lm" raises the
>>> following questions in my mind regarding "simulate" of a fit object:
>>>
>>>
>>>                1.  Shouldn't "simulate" start by simulating the random
>>> variability in the estimated parameters?  I need that for my current
>>> application.  If a generic "simulate" function should NOT include this,
>>> what should we call something that does include this?  And how does the
>>> current stats:::simulate.lm behavior fit with this?
>>>
>>>
>>>                2.  Shouldn't "simulate" of a model fit include an option
>>> for "newdata"?  I need that for my application.
>>>
>>>
>>>                3.  By comparing with "predict.glm", I felt I needed an
>>> argument 'type = c("link", "response")'.  "predict.glm" has an argument
>>> 'type = c("link", "response", "terms")'.  I didn't need "terms", so I
>>> didn't take the time to code that.  However, a general "simulate"
>>> function should probably include that.
>>>
>>>
>>>                4.  My application involves assumed Poisson counts.  I
>>> need
>>> to simulate those as well.  If I combined those with "simulate.glm",
>>> what would I call them?  I can't use the word "response", because that's
>>> already used with a different meaning. Might "observations" be the
>>> appropriate term?
>>>
>>>
>>>          What do you think?
>>>          Thanks,
>>>          Spencer Graves
>>>
>>>
>>>    > x0 <- c(-1, 1)
>>>    > var(x0)
>>> [1] 2
>>>    > fit0 <- lm(x0~1)
>>>    > vcov(fit0)
>>>                (Intercept)
>>> (Intercept)           1
>>>    > sim0 <- simulate(fit0, 10000, 1)
>>>    > var(unlist(sim0))
>>> [1] 2.006408
>>>    > x1 <- 1
>>>    > fit1 <- glm(x1~1, poisson)
>>>    > coef(fit1)
>>>     (Intercept)
>>> 4.676016e-11
>>>    > exp(coef(fit1))
>>> (Intercept)
>>>              1
>>>    > vcov(fit1)
>>>                (Intercept)
>>> (Intercept)   0.9999903
>>>    > sim1 <- simulate(fit1, 10000, 1)
>>>    > var(unlist(sim1))
>>> [1] 1.00617
>>>    > sessionInfo()
>>> R version 3.6.2 (2019-12-12)
>>> Platform: x86_64-apple-darwin15.6.0 (64-bit)
>>> Running under: macOS Catalina 10.15.2
>>>
>>> Matrix products: default
>>> BLAS:
>>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>>>
>>> LAPACK:
>>> /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
>>>
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods base
>>>
>>> loaded via a namespace (and not attached):
>>> [1] compiler_3.6.2 tools_3.6.2
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>