[R] ordering of factor levels in regression changes result
Tulinsky, Thomas
TTulinsky at DIRECTV.com
Fri Feb 3 22:16:54 CET 2012
I was surprised to find that just changing the base level of a factor variable changed the number of significant coefficients in the solution.
I was surprised at this and want to know how I should choose the order of the factors, if the order affects the result.
Here is the small example. It is taken from 'The R Book', Crawley p. 365.
The data is at
http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/competition.txt
In R
> comp<-read.table("C:\\Temp\\competition.txt", header=T)
> attach(comp)
Data has dependent variable 'biomass' and different types of 'clipping' that were done:
Control (none), n25, n50, r10, r5:
> summary(comp)
biomass clipping
Min. :415.0 control:6
1st Qu.:508.8 n25 :6
Median :568.0 n50 :6
Mean :561.8 r10 :6
3rd Qu.:631.8 r5 :6
Max. :731.0
List mean Biomass of each type of Clipping:
> aggregate (comp$biomass, list (comp$clipping) , mean)
Group.1 x
control 465.1667
n25 553.3333
n50 569.3333
r10 610.6667
r5 610.5000
do regression - get same result as book p. 365
Clipping type 'control' is not in list of coefficients because it is first alphabetically so it is folded into Intercept
> model<-lm(biomass ~ clipping)
> summary(model)
Call:
lm(formula = biomass ~ clipping)
Residuals:
Min 1Q Median 3Q Max
-103.333 -49.667 3.417 43.375 177.667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 465.17 28.75 16.177 9.4e-15 ***
clippingn25 88.17 40.66 2.168 0.03987 *
clippingn50 104.17 40.66 2.562 0.01683 *
clippingr10 145.50 40.66 3.578 0.00145 **
clippingr5 145.33 40.66 3.574 0.00147 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 70.43 on 25 degrees of freedom
Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129
F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752
Relevel - make 'n25' the base level of Clipping:
> comp$clipping <- relevel (comp$clipping, ref="n25")
> summary(comp)
biomass clipping
Min. :415.0 n25 :6
1st Qu.:508.8 control:6
Median :568.0 n50 :6
Mean :561.8 r10 :6
3rd Qu.:631.8 r5 :6
Max. :731.0
Redo LM with releveled data
> modelRelev<-lm(biomass~clipping, data=comp)
Different results. (Some parts, Residuals and Std Errors, are the same)
Especially note the Pr and Signifcance columns are different.
> summary(modelRelev)
Call:
lm(formula = biomass ~ clipping, data = comp)
Residuals:
Min 1Q Median 3Q Max
-103.333 -49.667 3.417 43.375 177.667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 553.33 28.75 19.244 <2e-16 ***
clippingcontrol -88.17 40.66 -2.168 0.0399 *
clippingn50 16.00 40.66 0.393 0.6973
clippingr10 57.33 40.66 1.410 0.1709
clippingr5 57.17 40.66 1.406 0.1721
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 70.43 on 25 degrees of freedom
Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129
F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752
More information about the R-help
mailing list