[R] Dealing with proportions
David Winsemius
dwinsemius at comcast.net
Thu Oct 6 16:49:01 CEST 2011
On Oct 6, 2011, at 10:33 AM, David Winsemius wrote:
>
> On Oct 5, 2011, at 4:08 PM, Sam wrote:
>
>> Dear list,
>>
>> I have very little experience in dealing with proportions, i am
>> sure this is a very simple question
>
> Sometimes making that claim in a group of statisticians can provoke
> an extended discussion to which there will be no eventual agreement.
> Discussions about testing proportions in groups with small numbers
> is one such excellent example.
>
>> but i could find no suitable answer beyond doing a chi-sq test and
>> then using the Marascuilo procedure as a post-hoc analysis.
>>
>> I am simply wanting to know if the proportions ( i.e the number of
>> Yes / No) significantly differ between the cases and if so which
>> cases are significantly high or low?
>>
>>
>> proportion <- structure(list(Case = structure(1:11, .Label = c("A",
>> "B", "C",
>> "D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), Yes =
>> c(18L,
>> 2L, 1L, 2L, 44L, 27L, 2L, 15L, 13L, 3L, 34L), No = c(171L, 11L,
>> 5L, 8L, 146L, 80L, 5L, 30L, 22L, 5L, 42L), Num = c(189L, 13L,
>> 6L, 10L, 190L, 107L, 7L, 45L, 35L, 8L, 76L)), .Names = c("Case",
>> "Yes", "No", "Num"), class = "data.frame", row.names = c(NA,
>> -11L))
>>
>
> I would probably have turned to logistic regression and used
> p.adjust to deal with the multiple comparisons issues. By default,
> the Intercept p-value estimate for a simple model with a factor
> variable as the predictor tests for the Intercept = 0 tests for of
> a proportion to 0.50 (since that is the proportion at which the
> log(odds) = 0). It then tests for equality to that odds for all the
> other levels (the same as odds ratios being 1 or their
> coefficients=0). (The first way I set this up I ended up with your A-
> level being the reference and the hypotheses being tested didn't
> seem reasonable or "interesting.... namely that the lowest
> proportion was equal to 50% and that all the rest of the proportions
> were equal to that lowest proportion.)
>
> as.matrix(proportion$Yes/proportion$Num)
> [,1]
> [1,] 0.0952381
> [2,] 0.1538462
> [3,] 0.1666667
> [4,] 0.2000000
> [5,] 0.2315789
> [6,] 0.2523364
> [7,] 0.2857143
> [8,] 0.3333333
> [9,] 0.3714286
> [10,] 0.3750000
> [11,] 0.4473684
>
> So I took your data and made your "K" level the reference value
> since its value (0.44) is the one closest to 50% and then looked at
> the p-values from the others as information about their similarity
> to the K level.
>
> proportion$Case <- relevel(proportion$Case, "K")
> casefit <- glm(cbind(Yes, Num) ~ Case, data= proportion,
> family="binomial")
Should have been :
> casefit <- glm(cbind(Yes, No) ~ Case, data= proportion,
> family="binomial")
> summary(casefit)
> summary(casefit)$coefficients
> p.adjust( summary(casefit)$coefficients[ , 4] )
>
> That set of hypotheses gives very few "significant" results, but
> most of your groups are pretty sparse so that should be expected.
> Another (and I think better) approach resulting in even fewer
> "significant" results is to use logistic regression with an offset
> term equal to the log(odds) for the entire group:
>
> log( with(proportion, sum(Yes)/sum(No) ) ) # returns [1] -1.181994
> This corresponds to an expected proportion of 0.2346939
>
> casefit <- glm(cbind(Yes, Num) ~ 0+ Case, data= proportion,
> offset=rep( log( with(proportion, sum(Yes)/
> sum(No) ) ), 11), # each level gets the same offset.
> family="binomial")
This should have been:
casefit <- glm(cbind(Yes, No) ~ 0+ Case, data= proportion,
offset=rep( log( with(proportion, sum(Yes)/sum(No) ) ), 11),
family="binomial")
>
> I used the 0+ on the RHS of the formula to force a labeled estimate
> for each level, since the p-values are now in reference to the
> overall proportion. Since only one level has a "significant" p-value
> and it is highly significant (due to the combination of higher
> numbers of "events" and "distance" from the NULL estimate) you will
> not see much change if you adjust its p-value.
>
New version of table :
> cbind(round( summary(casefit)$coefficients, 3), Prpns
=round( (proportion$Yes/proportion$Num)[c(11,1:10)], 3),
N_Yes=proportion$Yes[c(11,1:10)] )
Estimate Std. Error z value Pr(>|z|) Prpns N_Yes
CaseK 0.971 0.231 4.208 0.000 0.447 34
CaseA -1.069 0.248 -4.315 0.000 0.095 18
CaseB -0.523 0.769 -0.680 0.496 0.154 2
CaseC -0.427 1.095 -0.390 0.696 0.167 1
CaseD -0.204 0.791 -0.258 0.796 0.200 2
CaseE -0.017 0.172 -0.101 0.919 0.232 44
CaseF 0.096 0.223 0.430 0.667 0.252 27
CaseG 0.266 0.837 0.318 0.751 0.286 2
CaseH 0.489 0.316 1.546 0.122 0.333 15
CaseI 0.656 0.350 1.875 0.061 0.371 13
CaseJ 0.671 0.730 0.919 0.358 0.375 3
>
> Comparing the numbers and the observed proportions to the overall
> proportion (0.2346939) shows not many (only one actually) groups
> with strong evidence of "being different." I think you need larger
> numbers.
>
Now have two groups with significant differences.
> p.adjust( summary(casefit)$coefficients[,4] )
CaseK CaseA CaseB CaseC CaseD
0.0002580851 0.0001753949 1.0000000000 1.0000000000 1.0000000000
CaseE CaseF CaseG CaseH CaseI
1.0000000000 1.0000000000 1.0000000000 0.9770895678 0.5472099721
CaseJ
1.0000000000
> --
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list