[R] polr in MASS

array chip arrayprofile at yahoo.com
Tue May 6 18:27:57 CEST 2003


Hi,

It seems my last post had formatting problem, so
hopefully this time it works:

I am trying to use the proportional-odds model
(cumulative logistic regression) using the "polr"
function in the MASS library of R. 

I tried with the dataset of "housing" contained in the
MASS book where "Sat" (3 levels: low, medium, high) is
the dependent variable, "Infl" (low, medium, high),
"Type" (tower, apartment, atrium, terrace) and "Cont"
(low, high) are the predictor variables. And I have
some questions; hope someone could help me out. The
following commands are taken from the MASS book as
well. If you know R, it might be better to understand
my problem, if not, I hope it's still ok - my problem
is actually a statistically problem, not a programming
problem, ignore the codes, but just look at the
output.

For the dataset, the proportional odds model tried to
run cumulative logistic regression of the dependent
variable "Sat" on the 3 predictor variables "Infl",
"Type", and "Cont" without interactions. In R, the
lowest level of each variable is treated as the
reference level. 

> house.plr<-polr(Sat~Infl+Type+Cont,data=housing,
weights=Freq)
> summary(house.plr)

Call: polr(formula = Sat ~ Infl + Type + Cont, data =
housing, weights = Freq)

Coefficients:
                   Value Std. Error   t value
InflMedium     0.5663922 0.10465276  5.412109
InflHigh       1.2888137 0.12715609 10.135682
TypeApartment -0.5723552 0.11923800 -4.800107
TypeAtrium    -0.3661907 0.15517331 -2.359882
TypeTerrace   -1.0910073 0.15148595 -7.202036
ContHigh       0.3602803 0.09553577  3.771156 

Intercepts:
            Value   Std. Error t value
Low|Medium  -0.4961  0.1248    -3.9740
Medium|High  0.6907  0.1255     5.5049 

Residual Deviance: 3479.149 
AIC: 3495.149 

I also tried to predict the probabilities of the 3
categories of the dependent variable "Sat" for every
combination of the 3 predictor variables using the
predict function in R: 

> hnames<-lapply(housing[,-5],levels)
>
house.pr1<-predict(house.plr,expand.grid(hnames[-1]),type="probs")
> cbind(expand.grid(hnames[-1]),round(house.pr1,2))

     Infl      Type Cont  Low Medium High
1     Low     Tower  Low 0.38   0.29 0.33
2  Medium     Tower  Low 0.26   0.27 0.47
3    High     Tower  Low 0.14   0.21 0.65
4     Low Apartment  Low 0.52   0.26 0.22
5  Medium Apartment  Low 0.38   0.29 0.33
6    High Apartment  Low 0.23   0.26 0.51
7     Low    Atrium  Low 0.47   0.27 0.26
8  Medium    Atrium  Low 0.33   0.29 0.38
9    High    Atrium  Low 0.19   0.25 0.56
10    Low   Terrace  Low 0.64   0.21 0.14
11 Medium   Terrace  Low 0.51   0.26 0.23
12   High   Terrace  Low 0.33   0.29 0.38
13    Low     Tower High 0.30   0.28 0.42
14 Medium     Tower High 0.19   0.25 0.56
15   High     Tower High 0.10   0.17 0.72
16    Low Apartment High 0.43   0.28 0.29
17 Medium Apartment High 0.30   0.28 0.42
18   High Apartment High 0.17   0.23 0.60
19    Low    Atrium High 0.38   0.29 0.33
20 Medium    Atrium High 0.26   0.27 0.47
21   High    Atrium High 0.14   0.21 0.64
22    Low   Terrace High 0.56   0.25 0.19
23 Medium   Terrace High 0.42   0.28 0.30
24   High   Terrace High 0.26   0.27 0.47

What I am confused is that when I tried to reproduce
these predicted probabilities manually using the model
coefficients, I can sometimes get different results: 

According to cumulative logistic regression, the model
is:

  logit(P(Y<=k | x1,x2,x3) = a + b1*x1 + b2*x2 + b3*x3
for k=1,2,3

For example, for low Infl, Type tower, Cont low (all
on reference level),  

logit(P(Sat=low))=P(Sat=low)/(1-P(Sat=low))=-0.4961,
solve for
P(Sat=low)=exp(-0.4961)/(1+exp(-0.4961))=0.38;

and

logit(P(Sat=low, medium)) =
P(Sat=low,medium)/(1-P(Sat=low,medium)) = 0.6907,
solve for P(Sat=low,medium) =
exp(0.6907)/(1+exp(0.6907))=0.67, which is the sum of
0.38 plus 0.29.

The above 2 examples showed that I can reproduce the
predicted probabilities when using the intercept
alone. However, for other combinations of the
predictor variables, I can NOT reproduce the results. 

For example, for medium Infl, Type tower, cont low,  

logit(P(Sat=low))=P(Sat=low)/(1-P(Sat=low))=-0.4961+0.56639=0.07029,
solve for
P(Sat=low)=exp(0.07029)/(1+exp(0.07029))=0.52; but the
probability using "predict" function is 0.26.

and

logit(P(Sat=low, medium)) =
P(Sat=low,medium)/(1-P(Sat=low,medium)) =
0.6907+0.56639=1.25709, solve for P(Sat=low,medium) =
exp(1.25709)/(1+exp(1.25709))=0.78, which certainly is
NOT the sum of 0.26 plus 0.27, and is not the sum of
0.52 and 0.27. 

Did I misunderstand something here? Or the way I used
to reproduce the predicted probabilities is not
correct? 

Thanks very much!




More information about the R-help mailing list