[R] lm fails on some large input

Fox, John j|ox @end|ng |rom mcm@@ter@c@
Thu Apr 18 19:06:38 CEST 2019


Dear Peter,

> -----Original Message-----
> From: peter dalgaard [mailto:pdalgd using gmail.com]
> Sent: Thursday, April 18, 2019 12:23 PM
> To: Fox, John <jfox using mcmaster.ca>
> Cc: Michael Dewey <lists using dewey.myzen.co.uk>; Dingyuan Wang
> <gumblex using aosc.io>; r-help using r-project.org
> Subject: Re: [R] lm fails on some large input
> 
> Um, you need to reverse y and x there. The question was about lm(y ~ x)....
> 

Good catch! I missed that in the original posting, and lm() does indeed produce the LS solution for the regression of y on x. And, as I'd have expected, the naïve approach also fails for the regression of x on y:

> Y <- cbind(1, y)
> b <- solve(t(Y) %*% Y) %*% t(Y) %*% x
Error in solve.default(t(Y) %*% Y) : 
  system is computationally singular: reciprocal condition number = 6.19587e-35

resolving the mystery.

Thanks,
 John

> > X <- cbind(1, y)
> > solve(crossprod(X))
> Error in solve.default(crossprod(X)) :
>   system is computationally singular: reciprocal condition number = 6.19587e-
> 35
> 
> Actually, lm can QR perfectly OK, but it gets caught by its singularity detection:
> 
> > qr <- qr(X, tol=1e-10)
> > qr # without the tol bit, you get same thing but $rank == 1
> $qr
>                              y
>  [1,] -3.0000000 -4.520117e+09
>  [2,]  0.3333333 -3.426530e+01
>  [3,]  0.3333333 -2.947103e-02
>  [4,]  0.3333333  4.252164e-01
>  [5,]  0.3333333 -3.665468e-01
>  [6,]  0.3333333 -3.488029e-01
>  [7,]  0.3333333  2.614064e-01
>  [8,]  0.3333333  4.086982e-01
>  [9,]  0.3333333  2.018556e-03
> 
> $rank
> [1] 2
> 
> $qraux
> [1] 1.333333 1.571779
> 
> $pivot
> [1] 1 2
> 
> attr(,"class")
> [1] "qr"
> > x = c(79.744, 123.904, 87.29601, 116.352, 67.71201, 72.96001, 101.632,
> > 108.928, 94.08)
> > qr.coef(qr,x)
>                           y
> -2.403345e+09  1.595099e+00
> 
> > lm(x~y)
> 
> Call:
> lm(formula = x ~ y)
> 
> Coefficients:
> (Intercept)            y
>       94.73           NA
> 
> > lm(x~y, tol=1e-10)
> 
> Call:
> lm(formula = x ~ y, tol = 1e-10)
> 
> Coefficients:
> (Intercept)            y
>  -2.403e+09    1.595e+00
> 
> > lm(x~I(y-mean(y)))
> 
> Call:
> lm(formula = x ~ I(y - mean(y)))
> 
> Coefficients:
>    (Intercept)  I(y - mean(y))
>         94.734           1.595
> 
> 
> > On 18 Apr 2019, at 17:56 , Fox, John <jfox using mcmaster.ca> wrote:
> >
> > Dear Michael and Dingyuan Wang,
> >
> >> -----Original Message-----
> >> From: R-help [mailto:r-help-bounces using r-project.org] On Behalf Of
> >> Michael Dewey
> >> Sent: Thursday, April 18, 2019 11:25 AM
> >> To: Dingyuan Wang <gumblex using aosc.io>; r-help using r-project.org
> >> Subject: Re: [R] lm fails on some large input
> >>
> >> Perhaps subtract 1506705766 from y?
> >>
> >> Saying some other software does it well implies you know what the
> >> _correct_ answer is here but I would question what that means with
> >> this sort of data- set.
> >
> > It's rather an interesting problem, though, because the naïve computation of
> the LS solution works:
> >
> > plot(x, y)
> > X <- cbind(1, x)
> > b <- solve(t(X) %*% X) %*% t(X) %*% y
> > b
> > abline(b)
> >
> > That surprised me, because I expected that lm() computation, using the QR
> decomposition, would be more numerically stable.
> >
> > Best,
> > John
> >
> > -----------------------------------------------------------------
> > John Fox
> > Professor Emeritus
> > McMaster University
> > Hamilton, Ontario, Canada
> > Web: https://socialsciences.mcmaster.ca/jfox/
> >
> >
> >
> >>
> >> On 17/04/2019 07:26, Dingyuan Wang wrote:
> >>> Hi,
> >>>
> >>> This input doesn't have any interesting properties except y is unix
> >>> time. Spreadsheets can do this well.
> >>> Is this a bug that lm can't do x ~ y?
> >>>
> >>> R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
> >>> Copyright (C) 2018 The R Foundation for Statistical Computing
> >>> Platform: x86_64-pc-linux-gnu (64-bit)
> >>>
> >>>> x = c(79.744, 123.904, 87.29601, 116.352, 67.71201, 72.96001,
> >>> 101.632, 108.928, 94.08)  > y = c(1506705739.385, 1506705766.895,
> >>> 1506705746.293, 1506705761.873, 1506705734.743, 1506705735.351,
> >>> 1506705756.26, 1506705761.307,
> >>> 1506705747.372)
> >>>> m = lm(x ~ y)
> >>>> summary(m)
> >>>
> >>> Call:
> >>> lm(formula = x ~ y)
> >>>
> >>> Residuals:
> >>>      Min       1Q   Median       3Q      Max
> >>> -27.0222 -14.9902  -0.6542  14.1938  29.1698
> >>>
> >>> Coefficients: (1 not defined because of singularities)
> >>>             Estimate Std. Error t value Pr(>|t|)
> >>> (Intercept)   94.734      6.511   14.55 4.88e-07 *** y
> >>> NA         NA      NA       NA
> >>> ---
> >>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> >>>
> >>> Residual standard error: 19.53 on 8 degrees of freedom
> >>>
> >>>> summary(lm(y ~ x))
> >>>
> >>> Call:
> >>> lm(formula = y ~ x)
> >>>
> >>> Residuals:
> >>>     Min      1Q  Median      3Q     Max
> >>> -2.1687 -1.3345 -0.9466  1.3826  2.6551
> >>>
> >>> Coefficients:
> >>>              Estimate Std. Error   t value Pr(>|t|)
> >>> (Intercept) 1.507e+09  3.294e+00 4.574e+08  < 2e-16 *** x
> >>> 6.136e-01  3.413e-02 1.798e+01 4.07e-07 ***
> >>> ---
> >>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> >>>
> >>> Residual standard error: 1.885 on 7 degrees of freedom Multiple
> >>> R-squared:  0.9788,    Adjusted R-squared:  0.9758
> >>> F-statistic: 323.3 on 1 and 7 DF,  p-value: 4.068e-07
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>> ---
> >>> This email has been checked for viruses by AVG.
> >>> https://www.avg.com
> >>>
> >>>
> >>
> >> --
> >> Michael
> >> http://www.dewey.myzen.co.uk/home.html
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-
> >> guide.html and provide commented, minimal, self-contained,
> >> reproducible code.
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000
> Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
> 
> 
> 
> 
> 
> 
> 
> 



More information about the R-help mailing list