[R] Multiple regression in R - unstandardised coefficients a

Tue Aug 23 15:54:08 CEST 2011

On Tue, Aug 23, 2011 at 7:54 AM, JC Matthews <J.C.Matthews at bristol.ac.uk> wrote:
> Thankyou for your replies, you've answered my question and given me more to
> think on.  I guess it is unwise to draw any conclusions from the
> standardised results for these reasons.

No, by all means try to draw conclusions! Isn't that the point of the
analysis in the first place? All I am (we are?) saying is that you
need to do your homework and learn how to draw _appropriate_
conclusions from the analysis.

Best,
Ista

>
> James.
>
> --On 22 August 2011 17:30 +0100 ted.harding at wlandres.net wrote:
>
>> On 22-Aug-11 15:37:40, JC Matthews wrote:
>>>
>>> Hello,
>>>
>>> I have a statistical problem that I am using R for, but I am
>>> not making sense of the results. I am trying to use multiple
>>> regression to explore which variables (weather conditions)
>>> have the greater effect on a local atmospheric variable.
>>> The data is taken from a database that has 20391 data points (Z1).
>>>
>>> A simplified version of the data I'm looking at is given below,
>>> but I have a problem in that there is a disagreement in sign
>>> between the regression coefficients and the standardised regression
>>> coefficients. Intuitively I would expect both to be the same sign,
>>> but in many of the parameters, they are not.
>>>
>>> I am aware that there is a strong opinion that using standardised
>>> correlation coefficients is highly discouraged by some people,
>>> but I would nevertheless like to see the results. Not least
>>> because it has made me doubt the non-standardised values of B
>>> that R has given me.
>>>
>>> The code I have used, and some of the data, is as follows (once
>>> the database has been imported from SQL, and outliers removed).
>>>
>>> Z1sub  <- Z1[, c(2, 5, 7,11, 12, 13, 15, 16)]
>>> colnames(Z1sub) <- c("temp", "hum", "wind", "press", "rain", "s.rad",
>>> "mean1", "sd1" )
>>>
>>> attach(Z1sub)
>>> names(Z1sub)
>>>
>>>
>>> Model1d <- lm(mean1 ~ hum*wind*rain +  I(hum^2) + I(wind^2) + I(rain^2)
>>> )
>>>
>>> summary(Model1d)
>>>
>>> Call:
>>> lm(formula = mean1 ~ hum * wind * rain + I(hum^2) + I(wind^2) +
>>>    I(rain^2))
>>>
>>> Residuals:
>>>     Min       1Q   Median       3Q      Max
>>> -1230.64   -63.17    18.51    97.85  1275.73
>>>
>>> Coefficients:
>>>                Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)   -9.243e+02  5.689e+01 -16.246  < 2e-16 ***
>>> hum            2.835e+01  1.468e+00  19.312  < 2e-16 ***
>>> wind           1.236e+02  4.832e+00  25.587  < 2e-16 ***
>>> rain          -3.144e+03  7.635e+02  -4.118 3.84e-05 ***
>>> I(hum^2)      -1.953e-01  9.393e-03 -20.793  < 2e-16 ***
>>> I(wind^2)      6.914e-01  2.174e-01   3.181  0.00147 **
>>> I(rain^2)      2.730e+02  3.265e+01   8.362  < 2e-16 ***
>>> hum:wind      -1.782e+00  5.448e-02 -32.706  < 2e-16 ***
>>> hum:rain       2.798e+01  8.410e+00   3.327  0.00088 ***
>>> wind:rain      6.018e+02  2.146e+02   2.805  0.00504 **
>>> hum:wind:rain -6.606e+00  2.401e+00  -2.751  0.00594 **
>>> ---
>>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1
>>> ' ' 1
>>>
>>> Residual standard error: 180.5 on 20337 degrees of freedom
>>> Multiple R-squared: 0.2394,     Adjusted R-squared: 0.239
>>> F-statistic: 640.2 on 10 and 20337 DF,  p-value: < 2.2e-16
>>>
>>>
>>>
>>>
>>>
>>> To calculate the standardised coefficients, I used the following:
>>>
>>> Z1sub.scaled <- data.frame(scale( Z1sub[,c('temp', 'hum', 'wind',
>>> 'press',
>>> 'rain', 's.rad', 'mean1', 'sd1' ) ] ) )
>>>
>>> attach(Z1sub.scaled)
>>> names(Z1sub.scaled)
>>>
>>>
>>> Model1d.sc <- lm(mean1 ~ hum*wind*rain +  I(hum^2) + I(wind^2) +
>>> I(rain^2) )
>>>
>>> summary(Model1d.scaled)
>>>
>>> Call:
>>> lm(formula = mean1 ~ hum * wind * rain + I(hum^2) + I(wind^2) +
>>>    I(rain^2))
>>>
>>> Residuals:
>>>     Min       1Q   Median       3Q      Max
>>> -5.94713 -0.30527  0.08946  0.47287  6.16503
>>>
>>> Coefficients:
>>>                Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)    0.0806858  0.0096614   8.351  < 2e-16 ***
>>> hum           -0.4581509  0.0073456 -62.371  < 2e-16 ***
>>> wind          -0.1995316  0.0073767 -27.049  < 2e-16 ***
>>> rain          -0.1806894  0.0158037 -11.433  < 2e-16 ***
>>> I(hum^2)      -0.1120435  0.0053885 -20.793  < 2e-16 ***
>>> I(wind^2)      0.0172870  0.0054346   3.181  0.00147 **
>>> I(rain^2)      0.0040575  0.0004853   8.362  < 2e-16 ***
>>> hum:wind      -0.2188729  0.0066659 -32.835  < 2e-16 ***
>>> hum:rain       0.0267420  0.0146201   1.829  0.06740 .
>>> wind:rain      0.0365615  0.0122335   2.989  0.00281 **
>>> hum:wind:rain -0.0438790  0.0159479  -2.751  0.00594 **
>>> ---
>>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1
>>> ' ' 1
>>>
>>> Residual standard error: 0.8723 on 20337 degrees of freedom
>>> Multiple R-squared: 0.2394,     Adjusted R-squared: 0.239
>>> F-statistic: 640.2 on 10 and 20337 DF,  p-value: < 2.2e-16
>>>
>>>
>>>
>>> So having, for instance for humidity (hum), B = 28.35 +/-  1.468, while
>>> Beta = -0.4581509 +/- 0.0073456 is concerning. Is this normal, or is
>>> there
>>> an error in my code that has caused this contradiction?
>>>
>>> Many thanks,
>>>
>>> James.
>>> ----------------------
>>> JC Matthews
>>> School of Chemistry
>>> Bristol University
>>
>> Hi,
>> without having your data, so unable to check, I would not be
>> surprised if the changes of sign were the outcome of your model
>> formula, in particular the 3-variable (2nd-order) interaction,
>> i.e. you are using a model which is non-linear in the variables
>> themselves. Let's just take that part of the model:
>>
>>  lm(formula = mean1 ~ hum * wind * rain
>>
>> This, in its quantitative expression, expands to:
>>
>>  mean1 = C0 + C11*hum + C12*wind + C13*rain
>>             + C21*hum*wind + C22*hum*rain + C23*wind*rain
>>             + C31*hum*wind*rain
>>
>> Suppose that is for the unstandardised variables. Now express
>> it in terms of standardised variables (initial capital letters):
>>
>>  mean1 = C0 + C11*sd(hum)*(Hum + mean(hum)/sd(hum))
>>             + C12*sd(wind)*(Wind + mean(wind)/sd(wind))
>>             + C13*sd(rain)*(Rain + mean(rain)/sd(rain))
>>
>>             + C21*sd(hum)*sd(wind)*
>>                   (Hum + mean(hum)/sd(hum))*(Wind + mean(wind)/sd(wind))
>>
>>             + C22*sd(hum)*sd(rain)*
>>                   (Hum + mean(hum)/sd(hum))*(Rain + mean(rain)/sd(rain))
>>
>>             + C23*sd(wind)*sd(rain)*
>>                   (Wind + mean(wind)/sd(wind))*
>>                   (Rain + mean(rain)/sd(rain))
>>
>>             + C31*sd(hum)*sd(wind)*sd(rain)*
>>                 (Hum + mean(hum)/sd(hum))*
>>                 (Wind + mean(wind)/sd(wind))*
>>                 (Rain + mean(rain)/sd(rain))
>>
>> Now pick out, say, the coefficient of 'Hum' in this latter expression
>> (i.e. all the terms which involve 'Hum' but neither 'Wind' nor 'Rain'):
>>
>>  C11*sd(hum)
>> + C21*sd(hum)*sd(wind)*mean(wind)/sd(wind)
>> + C22*sd(hum)*sd(rain)*mean(rain)/sd(rain)
>> + C31*sd(hum)*sd(wind)*sd(rain)*
>>      (mean(wind)/sd(wind))*(mean(rain)/sd(rain))
>>
>> = C11*sd(hum)
>> + C21*sd(hum)*mean(wind)
>> + C22*sd(hum)*mean(rain)
>> + C31*sd(hum)*mean(wind)*mean(rain)
>>
>> So there is no reason to expect this to have even the same sign
>> as the original C11, the coefficient of 'hum', let alone any more
>> specific relationship with it!
>>
>> Hoping this helps,
>> Ted.
>>
>>
>>
>> --------------------------------------------------------------------
>> E-Mail: (Ted Harding) <ted.harding at wlandres.net>
>> Fax-to-email: +44 (0)870 094 0861
>> Date: 22-Aug-11                                       Time: 17:30:29
>> ------------------------------ XFMail ------------------------------
>
>
>
> ----------------------
> JC Matthews
> Atmospheric Chemistry Research Group
> School of Chemistry
> Bristol University
> J.C.Matthews at bristol.ac.uk
>

-- 
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org