[R] Dealing with proportions
David Winsemius
dwinsemius at comcast.net
Thu Oct 6 16:33:11 CEST 2011
On Oct 5, 2011, at 4:08 PM, Sam wrote:
> Dear list,
>
> I have very little experience in dealing with proportions, i am sure
> this is a very simple question
Sometimes making that claim in a group of statisticians can provoke an
extended discussion to which there will be no eventual agreement.
Discussions about testing proportions in groups with small numbers is
one such excellent example.
> but i could find no suitable answer beyond doing a chi-sq test and
> then using the Marascuilo procedure as a post-hoc analysis.
>
> I am simply wanting to know if the proportions ( i.e the number of
> Yes / No) significantly differ between the cases and if so which
> cases are significantly high or low?
>
>
> proportion <- structure(list(Case = structure(1:11, .Label = c("A",
> "B", "C",
> "D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), Yes =
> c(18L,
> 2L, 1L, 2L, 44L, 27L, 2L, 15L, 13L, 3L, 34L), No = c(171L, 11L,
> 5L, 8L, 146L, 80L, 5L, 30L, 22L, 5L, 42L), Num = c(189L, 13L,
> 6L, 10L, 190L, 107L, 7L, 45L, 35L, 8L, 76L)), .Names = c("Case",
> "Yes", "No", "Num"), class = "data.frame", row.names = c(NA,
> -11L))
>
I would probably have turned to logistic regression and used p.adjust
to deal with the multiple comparisons issues. By default, the
Intercept p-value estimate for a simple model with a factor variable
as the predictor tests for the Intercept = 0 tests for of a
proportion to 0.50 (since that is the proportion at which the
log(odds) = 0). It then tests for equality to that odds for all the
other levels (the same as odds ratios being 1 or their
coefficients=0). (The first way I set this up I ended up with your A-
level being the reference and the hypotheses being tested didn't seem
reasonable or "interesting.... namely that the lowest proportion was
equal to 50% and that all the rest of the proportions were equal to
that lowest proportion.)
as.matrix(proportion$Yes/proportion$Num)
[,1]
[1,] 0.0952381
[2,] 0.1538462
[3,] 0.1666667
[4,] 0.2000000
[5,] 0.2315789
[6,] 0.2523364
[7,] 0.2857143
[8,] 0.3333333
[9,] 0.3714286
[10,] 0.3750000
[11,] 0.4473684
So I took your data and made your "K" level the reference value since
its value (0.44) is the one closest to 50% and then looked at the p-
values from the others as information about their similarity to the K
level.
proportion$Case <- relevel(proportion$Case, "K")
casefit <- glm(cbind(Yes, Num) ~ Case, data= proportion,
family="binomial")
summary(casefit)
summary(casefit)$coefficients
p.adjust( summary(casefit)$coefficients[ , 4] )
That set of hypotheses gives very few "significant" results, but most
of your groups are pretty sparse so that should be expected. Another
(and I think better) approach resulting in even fewer "significant"
results is to use logistic regression with an offset term equal to the
log(odds) for the entire group:
log( with(proportion, sum(Yes)/sum(No) ) ) # returns [1] -1.181994
This corresponds to an expected proportion of 0.2346939
casefit <- glm(cbind(Yes, Num) ~ 0+ Case, data= proportion,
offset=rep( log( with(proportion, sum(Yes)/
sum(No) ) ), 11), # each level gets the same offset.
family="binomial")
I used the 0+ on the RHS of the formula to force a labeled estimate
for each level, since the p-values are now in reference to the overall
proportion. Since only one level has a "significant" p-value and it is
highly significant (due to the combination of higher numbers of
"events" and "distance" from the NULL estimate) you will not see much
change if you adjust its p-value.
cbind( round( summary(casefit)$coefficients, 3),
Prpns =round( (proportion$Yes/proportion$Num)[c(11,1:10)], 3),
N_Yes=proportion$Yes[c(11,1:10)] )
Estimate Std. Error z value Pr(>|z|) Prpns N_Yes
CaseK 0.378 0.206 1.830 0.067 0.447 34
CaseA -1.169 0.247 -4.741 0.000 0.095 18
CaseB -0.690 0.760 -0.908 0.364 0.154 2
CaseC -0.610 1.080 -0.565 0.572 0.167 1
CaseD -0.427 0.775 -0.552 0.581 0.200 2
CaseE -0.281 0.167 -1.679 0.093 0.232 44
CaseF -0.195 0.215 -0.905 0.365 0.252 27
CaseG -0.071 0.802 -0.088 0.930 0.286 2
CaseH 0.083 0.298 0.280 0.780 0.333 15
CaseI 0.192 0.325 0.590 0.555 0.371 13
CaseJ 0.201 0.677 0.297 0.766 0.375 3
Comparing the numbers and the observed proportions to the overall
proportion (0.2346939) shows not many (only one actually) groups with
strong evidence of "being different." I think you need larger numbers.
--
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list