[R] bigglm() results different from glm()
Thomas Lumley
tlumley at u.washington.edu
Wed Mar 18 08:30:30 CET 2009
There's a digits= argument to the print method.
> a <- bigglm(ff,data=trees, chunksize=10, sandwich=TRUE)
> print(summary(a),digits=5)
Large data regression model: bigglm(ff, data = trees, chunksize = 10, sandwich = TRUE)
Sample size = 31
Coef (95% CI) SE p
(Intercept) -6.63162 -8.08729 -5.17595 0.72783 0
log(Girth) 1.98265 1.87132 2.09398 0.05567 0
log(Height) 1.11712 0.73339 1.50085 0.19186 0
Sandwich (model-robust) standard errors
I suppose I should make it take its default from options(digits)-3 or something.
-thomas
On Tue, 17 Mar 2009, Francisco J. Zagmutt wrote:
> Dear Thomas and John,
>
> Thanks for your prompt reply and for looking at your code to explain these
> differences. I see you replied very late at night, so I am sorry if my little
> problem kept you awake!
>
> As you pointed out, the model indeed converges (as reported in model$converged)
> once I specify a larger number of iterations.
>
> A very minor comment: it seems that the reporting of the estimates in
> summary.biglm() is not taking the parameters from options(digits). For
> example, using the same data and models as before:
>
>> require(biglm)
>> options(digits=8)
>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000))
>> m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"), maxit=20)
>
>> summary(m1)
> <Snipped>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 2.3019509 0.0014147 1627.21 < 2.2e-16 ***
> ttment2 0.4052002 0.0018264 221.86 < 2.2e-16 ***
>
>> summary(m1big)
> Coef (95% CI) SE p
> (Intercept) 2.302 2.299 2.305 0.001 0
> ttment2 0.405 0.402 0.409 0.002 0
>
>
> To get more digits I can extract the point estimates using coef(m1big), but
> after looking at str(m1big) the only way I could figure to extract the p-values
> was:
>
>> summary(m1big)$mat[,"p"]
> (Intercept) ttment2
> 0 0
>
> Is there a way I can get summary.biglm() to report more digits directly?
>
> Thanks again,
>
> Francisco
>
>
>
> --
> Francisco J. Zagmutt
> Vose Consulting
> 2891 20th Street
> Boulder, CO, 80304
> USA
> www.voseconsulting.com
>
>
> Thomas Lumley wrote:
>>
>> Yes, the slow convergence is easier to get with the log link. Overshooting
>> the correct coefficient vector has more dramatic effects on the fitted
>> values and weights, and in this example the starting value of (0,0) is a
>> long way from the truth. The same sort of thing happens in the Cox model,
>> where there are real data sets that will cause numeric overflow in a
>> careless implementation.
>>
>> It might be worth trying to guess better starting values: saving an
>> iteration or two would be useful with large data sets.
>>
>> -thomas
>>
>>
>> On Tue, 17 Mar 2009, John Fox wrote:
>>
>>> Dear Francisco,
>>>
>>> I was able to duplicate the problem that you reported, and in addition
>>> discovered that the problem seems to be peculiar to the poisson family.
>>> lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical
>>> results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat).
>>> Another example, with the binomial family:
>>>
>>>> set.seed(12345)
>>>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>>> ttment=gl(2,50000))
>>>> dat$y0 <- ifelse(dat$y > 12, 1, 0)
>>>> m1b <- glm(y0~ttment, data=dat, family=binomial)
>>>> m1bigb <- bigglm(y0~ttment , data=dat, family=binomial())
>>>> coef(m1b)
>>> (Intercept) ttment2
>>> -1.33508 2.34263
>>>> coef(m1bigb)
>>> (Intercept) ttment2
>>> -1.33508 2.34263
>>>> deviance(m1b)
>>> [1] 109244
>>>> deviance(m1bigb)
>>> [1] 109244
>>>
>>> I'm copying this message to Thomas Lumley, who's the author and maintainer
>>> of the biglm package.
>>>
>>> Regards,
>>> John
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>> On
>>>> Behalf Of Francisco J. Zagmutt
>>>> Sent: March-16-09 10:26 PM
>>>> To: r-help at stat.math.ethz.ch
>>>> Subject: [R] bigglm() results different from glm()
>>>>
>>>> Dear all,
>>>>
>>>> I am using the bigglm package to fit a few GLM's to a large dataset (3
>>>> million rows, 6 columns). While trying to fit a Poisson GLM I noticed
>>>> that the coefficient estimates were very different from what I obtained
>>>> when estimating the model on a smaller dataset using glm(), I wrote a
>>>> very basic toy example to compare the results of bigglm() against a
>>>> glm() call. Consider the following code:
>>>>
>>>>
>>>> > require(biglm)
>>>> > options(digits=6, scipen=3, contrasts = c("contr.treatment",
>>>> "contr.poly"))
>>>> > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>>>> ttment=gl(2,50000))
>>>> > m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>>>> > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
>>>> > summary(m1)
>>>>
>>>> <snipped output for this email>
>>>> Coefficients:
>>>> Estimate Std. Error z value Pr(>|z|)
>>>> (Intercept) 2.30305 0.00141 1629 <2e-16 ***
>>>> ttment2 0.40429 0.00183 221 <2e-16 ***
>>>>
>>>> Null deviance: 151889 on 99999 degrees of freedom
>>>> Residual deviance: 101848 on 99998 degrees of freedom
>>>> AIC: 533152
>>>>
>>>> > summary(m1big)
>>>> Large data regression model: bigglm(y ~ ttment, data = dat, family =
>>>> poisson(link = "log"))
>>>> Sample size = 100000
>>>> Coef (95% CI) SE p
>>>> (Intercept) 2.651 2.650 2.653 0.001 0
>>>> ttment2 4.346 4.344 4.348 0.001 0
>>>>
>>>> > m1big$deviance
>>>> [1] 287158986
>>>>
>>>>
>>>> Notice that the coefficients and deviance are quite different in the
>>>> model estimated using bigglm(). If I change the chunk to
>>>> seq(1000,10000,1000) the estimates remain the same.
>>>>
>>>> Can someone help me understand what is causing these differences?
>>>>
>>>> Here is my version info:
>>>>
>>>> > version
>>>> _
>>>> platform i386-pc-mingw32
>>>> arch i386
>>>> os mingw32
>>>> system i386, mingw32
>>>> status
>>>> major 2
>>>> minor 8.1
>>>> year 2008
>>>> month 12
>>>> day 22
>>>> svn rev 47281
>>>> language R
>>>> version.string R version 2.8.1 (2008-12-22)
>>>>
>>>>
>>>> Many thanks in advance for your help,
>>>>
>>>> Francisco
>>>>
>>>> -- Francisco J. Zagmutt
>>>> Vose Consulting
>>>> 2891 20th Street
>>>> Boulder, CO, 80304
>>>> USA
>>>> www.voseconsulting.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> Thomas Lumley Assoc. Professor, Biostatistics
>> tlumley at u.washington.edu University of Washington, Seattle
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list