[Rd] lm() gives different results to lm.ridge() and SPSS
Nick Brown
nick.brown at free.fr
Fri May 5 15:40:59 CEST 2017
>I conjecture that something in the vicinity of
> res <- lm(DEPRESSION ~ scale(ZMEAN_PA) + scale(ZDIVERSITY_PA) + scale(ZMEAN_PA * ZDIVERSITY_PA), data=dat)
>summary(res)
> would reproduce the SPSS Beta values.
Yes, that works. Thanks!
----- Original Message -----
From: "peter dalgaard" <pdalgd at gmail.com>
To: "Viechtbauer Wolfgang (SP)" <wolfgang.viechtbauer at maastrichtuniversity.nl>, "Nick Brown" <nick.brown at free.fr>
Cc: r-devel at r-project.org
Sent: Friday, 5 May, 2017 3:33:29 PM
Subject: Re: [Rd] lm() gives different results to lm.ridge() and SPSS
Thanks, I was getting to try this, but got side tracked by actual work...
Your analysis reproduces the SPSS unscaled estimates. It still remains to figure out how Nick got
>
coefficients(lm(ZDEPRESSION ~ ZMEAN_PA * ZDIVERSITY_PA, data=s1))
(Intercept) ZMEAN_PA ZDIVERSITY_PA ZMEAN_PA:ZDIVERSITY_PA
0.07342198 -0.39650356 -0.36569488 -0.09435788
which does not match your output. I suspect that ZMEAN_PA and ZDIVERSITY_PA were scaled for this analysis (but the interaction term still obviously is not). I conjecture that something in the vicinity of
res <- lm(DEPRESSION ~ scale(ZMEAN_PA) + scale(ZDIVERSITY_PA) + scale(ZMEAN_PA * ZDIVERSITY_PA), data=dat)
summary(res)
would reproduce the SPSS Beta values.
> On 5 May 2017, at 14:43 , Viechtbauer Wolfgang (SP) <wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:
>
> I had no problems running regression models in SPSS and R that yielded the same results for these data.
>
> The difference you are observing is from fitting different models. In R, you fitted:
>
> res <- lm(DEPRESSION ~ ZMEAN_PA * ZDIVERSITY_PA, data=dat)
> summary(res)
>
> The interaction term is the product of ZMEAN_PA and ZDIVERSITY_PA. This is not a standardized variable itself and not the same as "ZINTER_PA_C" in the png you showed, which is not a variable in the dataset, but can be created with:
>
> dat$ZINTER_PA_C <- with(dat, scale(ZMEAN_PA * ZDIVERSITY_PA))
>
> If you want the same results as in SPSS, then you need to fit:
>
> res <- lm(DEPRESSION ~ ZMEAN_PA + ZDIVERSITY_PA + ZINTER_PA_C, data=dat)
> summary(res)
>
> This yields:
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 6.41041 0.01722 372.21 <2e-16 ***
> ZMEAN_PA -1.62726 0.04200 -38.74 <2e-16 ***
> ZDIVERSITY_PA -1.50082 0.07447 -20.15 <2e-16 ***
> ZINTER_PA_C -0.58955 0.05288 -11.15 <2e-16 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Exactly the same as in the png.
>
> Peter already mentioned this as a possible reason for the discrepancy: https://stat.ethz.ch/pipermail/r-devel/2017-May/074191.html ("Is it perhaps the case that x1 and x2 have already been scaled to have standard deviation 1? In that case, x1*x2 won't be.")
>
> Best,
> Wolfgang
>
> -----Original Message-----
> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Nick Brown
> Sent: Friday, May 05, 2017 10:40
> To: peter dalgaard
> Cc: r-devel at r-project.org
> Subject: Re: [Rd] lm() gives different results to lm.ridge() and SPSS
>
> Hi,
>
> Here is (I hope) all the relevant output from R.
>
>> mean(s1$ZDEPRESSION, na.rm=T) [1] -1.041546e-16 > mean(s1$ZDIVERSITY_PA, na.rm=T) [1] -9.660583e-16 > mean(s1$ZMEAN_PA, na.rm=T) [1] -5.430282e-15 > lm.ridge(ZDEPRESSION ~ ZMEAN_PA * ZDIVERSITY_PA, data=s1)$coef ZMEAN_PA ZDIVERSITY_PA ZMEAN_PA:ZDIVERSITY_PA
> -0.3962254 -0.3636026 -0.1425772 ## This is what I thought was the problem originally. :-)
>
>
>> coefficients(lm(ZDEPRESSION ~ ZMEAN_PA * ZDIVERSITY_PA, data=s1)) (Intercept) ZMEAN_PA ZDIVERSITY_PA ZMEAN_PA:ZDIVERSITY_PA
> 0.07342198 -0.39650356 -0.36569488 -0.09435788 > coefficients(lm.ridge(ZDEPRESSION ~ ZMEAN_PA * ZDIVERSITY_PA, data=s1)) ZMEAN_PA ZDIVERSITY_PA ZMEAN_PA:ZDIVERSITY_PA
> 0.07342198 -0.39650356 -0.36569488 -0.09435788 The equivalent from SPSS is attached. The unstandardized coefficients in SPSS look nothing like those in R. The standardized coefficients in SPSS match the lm.ridge()$coef numbers very closely indeed, suggesting that the same algorithm may be in use.
>
> I have put the dataset file, which is the untouched original I received from the authors, in this Dropbox folder: https://www.dropbox.com/sh/xsebjy55ius1ysb/AADwYUyV1bl6-iAw7ACuF1_La?dl=0. You can read it into R with this code (one variable needs to be standardized and centered; everything else is already in the file):
>
> s1 <- read.csv("Emodiversity_Study1.csv", stringsAsFactors=FALSE) s1$ZDEPRESSION <- scale(s1$DEPRESSION)
> Hey, maybe R is fine and I've stumbled on a bug in SPSS? If so, I'm sure IBM will want to fix it quickly (ha ha ha).
>
> Nick
>
> ----- Original Message -----
>
> From: "peter dalgaard" <pdalgd at gmail.com>
> To: "Nick Brown" <nick.brown at free.fr>
> Cc: "Simon Bonner" <sbonner6 at uwo.ca>, r-devel at r-project.org
> Sent: Friday, 5 May, 2017 10:02:10 AM
> Subject: Re: [Rd] lm() gives different results to lm.ridge() and SPSS
>
> I asked you before, but in case you missed it: Are you looking at the right place in SPSS output?
>
> The UNstandardized coefficients should be comparable to R, i.e. the "B" column, not "Beta".
>
> -pd
>
>> On 5 May 2017, at 01:58 , Nick Brown <nick.brown at free.fr> wrote:
>>
>> Hi Simon,
>>
>> Yes, if I uses coefficients() I get the same results for lm() and lm.ridge(). So that's consistent, at least.
>>
>> Interestingly, the "wrong" number I get from lm.ridge()$coef agrees with the value from SPSS to 5dp, which is an interesting coincidence if these numbers have no particular external meaning in lm.ridge().
>>
>> Kind regards,
>> Nick
>>
>> ----- Original Message -----
>>
>> From: "Simon Bonner" <sbonner6 at uwo.ca>
>> To: "Nick Brown" <nick.brown at free.fr>, r-devel at r-project.org
>> Sent: Thursday, 4 May, 2017 7:07:33 PM
>> Subject: RE: [Rd] lm() gives different results to lm.ridge() and SPSS
>>
>> Hi Nick,
>>
>> I think that the problem here is your use of $coef to extract the coefficients of the ridge regression. The help for lm.ridge states that coef is a "matrix of coefficients, one row for each value of lambda. Note that these are not on the original scale and are for use by the coef method."
>>
>> I ran a small test with simulated data, code is copied below, and indeed the output from lm.ridge differs depending on whether the coefficients are accessed via $coef or via the coefficients() function. The latter does produce results that match the output from lm.
>>
>> I hope that helps.
>>
>> Cheers,
>>
>> Simon
>>
>> ## Load packages
>> library(MASS)
>>
>> ## Set seed
>> set.seed(8888)
>>
>> ## Set parameters
>> n <- 100
>> beta <- c(1,0,1)
>> sigma <- .5
>> rho <- .75
>>
>> ## Simulate correlated covariates
>> Sigma <- matrix(c(1,rho,rho,1),ncol=2)
>> X <- mvrnorm(n,c(0,0),Sigma=Sigma)
>>
>> ## Simulate data
>> mu <- beta[1] + X %*% beta[-1]
>> y <- rnorm(n,mu,sigma)
>>
>> ## Fit model with lm()
>> fit1 <- lm(y ~ X)
>>
>> ## Fit model with lm.ridge()
>> fit2 <- lm.ridge(y ~ X)
>>
>> ## Compare coefficients
>> cbind(fit1$coefficients,c(NA,fit2$coef),coefficients(fit2))
>>
>> [,1] [,2] [,3]
>> (Intercept) 0.99276001 NA 0.99276001
>> X1 -0.03980772 -0.04282391 -0.03980772
>> X2 1.11167179 1.06200476 1.11167179
>>
>> --
>>
>> Simon Bonner
>> Assistant Professor of Environmetrics/ Director MMASc
>> Department of Statistical and Actuarial Sciences/Department of Biology
>> University of Western Ontario
>>
>> Office: Western Science Centre rm 276
>>
>> Email: sbonner6 at uwo.ca | Telephone: 519-661-2111 x88205 | Fax: 519-661-3813
>> Twitter: @bonnerstatslab | Website: http://simon.bonners.ca/bonner-lab/wpblog/
>>
>>> -----Original Message-----
>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Nick
>>> Brown
>>> Sent: May 4, 2017 10:29 AM
>>> To: r-devel at r-project.org
>>> Subject: [Rd] lm() gives different results to lm.ridge() and SPSS
>>>
>>> Hallo,
>>>
>>> I hope I am posting to the right place. I was advised to try this list by Ben Bolker
>>> (https://twitter.com/bolkerb/status/859909918446497795). I also posted this
>>> question to StackOverflow
>>> (http://stackoverflow.com/questions/43771269/lm-gives-different-results-
>>> from-lm-ridgelambda-0). I am a relative newcomer to R, but I wrote my first
>>> program in 1975 and have been paid to program in about 15 different
>>> languages, so I have some general background knowledge.
>>>
>>> I have a regression from which I extract the coefficients like this:
>>> lm(y ~ x1 * x2, data=ds)$coef
>>> That gives: x1=0.40, x2=0.37, x1*x2=0.09
>>>
>>> When I do the same regression in SPSS, I get:
>>> beta(x1)=0.40, beta(x2)=0.37, beta(x1*x2)=0.14.
>>> So the main effects are in agreement, but there is quite a difference in the
>>> coefficient for the interaction.
>>>
>>> X1 and X2 are correlated about .75 (yes, yes, I know - this model wasn't my
>>> idea, but it got published), so there is quite possibly something going on with
>>> collinearity. So I thought I'd try lm.ridge() to see if I can get an idea of where
>>> the problems are occurring.
>>>
>>> The starting point is to run lm.ridge() with lambda=0 (i.e., no ridge penalty) and
>>> check we get the same results as with lm():
>>> lm.ridge(y ~ x1 * x2, lambda=0, data=ds)$coef
>>> x1=0.40, x2=0.37, x1*x2=0.14
>>> So lm.ridge() agrees with SPSS, but not with lm(). (Of course, lambda=0 is the
>>> default, so it can be omitted; I can alternate between including or deleting
>>> ".ridge" in the function call, and watch the coefficient for the interaction
>>> change.)
>>>
>>> What seems slightly strange to me here is that I assumed that lm.ridge() just
>>> piggybacks on lm() anyway, so in the specific case where lambda=0 and there
>>> is no "ridging" to do, I'd expect exactly the same results.
>>>
>>> Unfortunately there are 34,000 cases in the dataset, so a "minimal" reprex will
>>> not be easy to make, but I can share the data via Dropbox or something if that
>>> would help.
>>>
>>> I appreciate that when there is strong collinearity then all bets are off in terms
>>> of what the betas mean, but I would really expect lm() and lm.ridge() to give
>>> the same results. (I would be happy to ignore SPSS, but for the moment it's
>>> part of the majority!)
>>>
>>> Thanks for reading,
>>> Nick
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
[[alternative HTML version deleted]]
More information about the R-devel
mailing list