[R] Increasing number of observations worsen the regression model
Fox, John
j|ox @end|ng |rom mcm@@ter@c@
Sun May 26 16:06:26 CEST 2019
Dear Raffaele,
Using your code, with one modification -- setting the seed for R's random number generator to make the result reproducible -- I get:
> set.seed(12345)
. . .
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
Call:
lm(formula = yvar ~ xvar)
Residuals:
Min 1Q Median 3Q Max
-4.0293 -0.6732 0.0021 0.6749 4.2883
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0057713 0.0057529 174.8 <2e-16 ***
xvar 2.0000889 0.0009998 2000.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9964 on 29998 degrees of freedom
Multiple R-squared: 0.9926, Adjusted R-squared: 0.9926
F-statistic: 4.002e+06 on 1 and 29998 DF, p-value: < 2.2e-16
which is more or less what one would expect.
My guess: you've saved your R workspace from a previous session, and it is then loaded at the start of your R session; something in the saved workspace is affecting the result, although frankly I can't think what that might be.
I hope this helps,
John
-----------------------------------------------------------------
John Fox
Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: https://socialsciences.mcmaster.ca/jfox/
> -----Original Message-----
> From: R-help [mailto:r-help-bounces using r-project.org] On Behalf Of Raffa
> Sent: Saturday, May 25, 2019 8:38 AM
> To: r-help using r-project.org
> Subject: [R] Increasing number of observations worsen the regression model
>
> I have the following code:
>
> ```
>
> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
> domain <- seq(min(xvar), max(xvar)) # define a vector of x values to feed
> into model lines(domain, predict(lmMod, newdata =
> data.frame(xvar=domain))) # add regression line, using `predict` to generate
> y-values
>
> ```
>
> I expected the coefficients to be something similar to [1,2]. Instead R keeps
> throwing at me random numbers that are not statistically significant and don't
> fit the model, and I have 20k observations. For example
>
> ```
>
> Call:
> lm(formula = yvar ~ xvar)
>
> Residuals:
> Min 1Q Median 3Q Max
> -21.384 -8.908 1.016 10.972 23.663
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 0.0007145 0.0670316 0.011 0.991
> xvar 0.0168271 0.0116420 1.445 0.148
>
> Residual standard error: 11.61 on 29998 degrees of freedom Multiple R-
> squared: 7.038e-05, Adjusted R-squared: 3.705e-05
> F-statistic: 2.112 on 1 and 29998 DF, p-value: 0.1462
>
> ```
>
>
> The strange thing is that the code works perfectly for N=200 or N=2000.
> It's only for larger N that this thing happen U(for example, N=20000). I have
> tried to ask for example in CrossValidated
> <https://stats.stackexchange.com/questions/410050/increasing-number-of-
> observations-worsen-the-regression-model>
> but the code works for them. Any help?
>
> I am runnign R 3.6.0 on Kubuntu 19.04
>
> Best regards
>
> Raffaele
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list