[R] bigglm() results different from glm()
Thomas Lumley
tlumley at u.washington.edu
Tue Mar 17 18:22:57 CET 2009
Yes, the slow convergence is easier to get with the log link. Overshooting the correct coefficient vector has more dramatic effects on the fitted values and weights, and in this example the starting value of (0,0) is a long way from the truth. The same sort of thing happens in the Cox model, where there are real data sets that will cause numeric overflow in a careless implementation.
It might be worth trying to guess better starting values: saving an iteration or two would be useful with large data sets.
-thomas
On Tue, 17 Mar 2009, John Fox wrote:
> Dear Francisco,
>
> I was able to duplicate the problem that you reported, and in addition
> discovered that the problem seems to be peculiar to the poisson family.
> lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical
> results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat).
> Another example, with the binomial family:
>
>> set.seed(12345)
>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
> ttment=gl(2,50000))
>> dat$y0 <- ifelse(dat$y > 12, 1, 0)
>> m1b <- glm(y0~ttment, data=dat, family=binomial)
>> m1bigb <- bigglm(y0~ttment , data=dat, family=binomial())
>> coef(m1b)
> (Intercept) ttment2
> -1.33508 2.34263
>> coef(m1bigb)
> (Intercept) ttment2
> -1.33508 2.34263
>> deviance(m1b)
> [1] 109244
>> deviance(m1bigb)
> [1] 109244
>
> I'm copying this message to Thomas Lumley, who's the author and maintainer
> of the biglm package.
>
> Regards,
> John
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On
>> Behalf Of Francisco J. Zagmutt
>> Sent: March-16-09 10:26 PM
>> To: r-help at stat.math.ethz.ch
>> Subject: [R] bigglm() results different from glm()
>>
>> Dear all,
>>
>> I am using the bigglm package to fit a few GLM's to a large dataset (3
>> million rows, 6 columns). While trying to fit a Poisson GLM I noticed
>> that the coefficient estimates were very different from what I obtained
>> when estimating the model on a smaller dataset using glm(), I wrote a
>> very basic toy example to compare the results of bigglm() against a
>> glm() call. Consider the following code:
>>
>>
>> > require(biglm)
>> > options(digits=6, scipen=3, contrasts = c("contr.treatment",
>> "contr.poly"))
>> > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>> ttment=gl(2,50000))
>> > m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>> > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
>> > summary(m1)
>>
>> <snipped output for this email>
>> Coefficients:
>> Estimate Std. Error z value Pr(>|z|)
>> (Intercept) 2.30305 0.00141 1629 <2e-16 ***
>> ttment2 0.40429 0.00183 221 <2e-16 ***
>>
>> Null deviance: 151889 on 99999 degrees of freedom
>> Residual deviance: 101848 on 99998 degrees of freedom
>> AIC: 533152
>>
>> > summary(m1big)
>> Large data regression model: bigglm(y ~ ttment, data = dat, family =
>> poisson(link = "log"))
>> Sample size = 100000
>> Coef (95% CI) SE p
>> (Intercept) 2.651 2.650 2.653 0.001 0
>> ttment2 4.346 4.344 4.348 0.001 0
>>
>> > m1big$deviance
>> [1] 287158986
>>
>>
>> Notice that the coefficients and deviance are quite different in the
>> model estimated using bigglm(). If I change the chunk to
>> seq(1000,10000,1000) the estimates remain the same.
>>
>> Can someone help me understand what is causing these differences?
>>
>> Here is my version info:
>>
>> > version
>> _
>> platform i386-pc-mingw32
>> arch i386
>> os mingw32
>> system i386, mingw32
>> status
>> major 2
>> minor 8.1
>> year 2008
>> month 12
>> day 22
>> svn rev 47281
>> language R
>> version.string R version 2.8.1 (2008-12-22)
>>
>>
>> Many thanks in advance for your help,
>>
>> Francisco
>>
>> --
>> Francisco J. Zagmutt
>> Vose Consulting
>> 2891 20th Street
>> Boulder, CO, 80304
>> USA
>> www.voseconsulting.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list