[R] Dealing with proportions

Thu Oct 6 16:49:01 CEST 2011

On Oct 6, 2011, at 10:33 AM, David Winsemius wrote:

>
> On Oct 5, 2011, at 4:08 PM, Sam wrote:
>
>> Dear list,
>>
>> I have very little experience in dealing with proportions, i am  
>> sure this is a very simple question
>
> Sometimes making that claim in a group of statisticians can provoke  
> an extended discussion to which there will be no eventual agreement.  
> Discussions about testing proportions in groups with small numbers  
> is one such excellent example.
>
>> but i could find no suitable answer beyond doing a chi-sq test  and  
>> then using the Marascuilo procedure as a post-hoc analysis.
>>
>> I am simply wanting to know if the proportions ( i.e the number of  
>> Yes / No) significantly differ between the cases and if so which  
>> cases are significantly high or low?
>>
>>
>> proportion <- structure(list(Case = structure(1:11, .Label = c("A",  
>> "B", "C",
>> "D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), Yes =  
>> c(18L,
>> 2L, 1L, 2L, 44L, 27L, 2L, 15L, 13L, 3L, 34L), No = c(171L, 11L,
>> 5L, 8L, 146L, 80L, 5L, 30L, 22L, 5L, 42L), Num = c(189L, 13L,
>> 6L, 10L, 190L, 107L, 7L, 45L, 35L, 8L, 76L)), .Names = c("Case",
>> "Yes", "No", "Num"), class = "data.frame", row.names = c(NA,
>> -11L))
>>
>
> I would probably have turned to logistic regression and used  
> p.adjust to deal with the multiple comparisons issues. By default,  
> the Intercept p-value estimate for a simple model with a factor  
> variable as the predictor tests for the Intercept = 0 tests for of  
> a  proportion  to 0.50 (since that is the proportion at which the  
> log(odds) = 0). It then tests for equality to that odds for all the  
> other levels (the same as odds ratios being 1 or their  
> coefficients=0). (The first way I set this up I ended up with your A- 
> level being the reference and the hypotheses being tested didn't  
> seem reasonable or "interesting.... namely that the lowest  
> proportion was equal to 50% and that all the rest of the proportions  
> were equal to that lowest proportion.)
>
> as.matrix(proportion$Yes/proportion$Num)
>           [,1]
> [1,] 0.0952381
> [2,] 0.1538462
> [3,] 0.1666667
> [4,] 0.2000000
> [5,] 0.2315789
> [6,] 0.2523364
> [7,] 0.2857143
> [8,] 0.3333333
> [9,] 0.3714286
> [10,] 0.3750000
> [11,] 0.4473684
>
> So I took your data and made your "K" level the reference value  
> since its value (0.44) is the one closest to 50% and then looked at  
> the p-values from the others as information about their similarity  
> to the K level.
>
> proportion$Case <- relevel(proportion$Case, "K")
> casefit <- glm(cbind(Yes, Num) ~ Case, data= proportion,  
> family="binomial")

Should have been :

> casefit <- glm(cbind(Yes, No) ~ Case, data= proportion,  
> family="binomial")

> summary(casefit)
> summary(casefit)$coefficients
> p.adjust( summary(casefit)$coefficients[ , 4] )
>
> That set of hypotheses gives very few "significant" results, but  
> most of your groups are pretty sparse so that should be expected.  
> Another (and I think better) approach resulting in even fewer  
> "significant" results is to use logistic regression with an offset  
> term equal to the log(odds) for the entire group:
>
> log(  with(proportion, sum(Yes)/sum(No) ) )   # returns [1] -1.181994
> This corresponds to an expected proportion of 0.2346939
>
> casefit <- glm(cbind(Yes, Num) ~ 0+ Case, data= proportion,
>               offset=rep( log( with(proportion, sum(Yes)/ 
> sum(No) ) ), 11),   # each level gets the same offset.
>               family="binomial")

This should have been:

casefit <- glm(cbind(Yes, No) ~ 0+ Case, data= proportion,  
offset=rep( log( with(proportion, sum(Yes)/sum(No) ) ), 11),  
family="binomial")

>
> I used the 0+ on the RHS of the formula to force a labeled estimate  
> for each level, since the p-values are now in reference to the  
> overall proportion. Since only one level has a "significant" p-value  
> and it is highly significant (due to the combination of higher  
> numbers of "events" and "distance" from the NULL estimate) you will  
> not see much change if you adjust its p-value.
>

New version of table :

 >  cbind(round( summary(casefit)$coefficients, 3), Prpns  
=round( (proportion$Yes/proportion$Num)[c(11,1:10)], 3),  
N_Yes=proportion$Yes[c(11,1:10)] )
       Estimate Std. Error z value Pr(>|z|) Prpns N_Yes
CaseK    0.971      0.231   4.208    0.000 0.447    34
CaseA   -1.069      0.248  -4.315    0.000 0.095    18
CaseB   -0.523      0.769  -0.680    0.496 0.154     2
CaseC   -0.427      1.095  -0.390    0.696 0.167     1
CaseD   -0.204      0.791  -0.258    0.796 0.200     2
CaseE   -0.017      0.172  -0.101    0.919 0.232    44
CaseF    0.096      0.223   0.430    0.667 0.252    27
CaseG    0.266      0.837   0.318    0.751 0.286     2
CaseH    0.489      0.316   1.546    0.122 0.333    15
CaseI    0.656      0.350   1.875    0.061 0.371    13
CaseJ    0.671      0.730   0.919    0.358 0.375     3
>
> Comparing the numbers and the observed proportions to the overall  
> proportion (0.2346939) shows not many (only one actually) groups  
> with strong evidence of "being different." I think you need larger  
> numbers.
>

Now have two groups with significant differences.
 > p.adjust( summary(casefit)$coefficients[,4] )
        CaseK        CaseA        CaseB        CaseC        CaseD
0.0002580851 0.0001753949 1.0000000000 1.0000000000 1.0000000000
        CaseE        CaseF        CaseG        CaseH        CaseI
1.0000000000 1.0000000000 1.0000000000 0.9770895678 0.5472099721
        CaseJ
1.0000000000

> -- 
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT