[R] Sample size Determination to Compare Three Independent Proportions

Tue Aug 10 15:28:52 CEST 2021

Hi,

A search would suggest that there may not be an R function/package that 
provides power/sample size calculations for the specific scenarios that 
you are describing. There may be something that I am missing, and there 
is also other dedicated software such as PASS 
(https://www.ncss.com/software/pass/) which is not free, but provides a 
large library of possibly relevant functions and support.

That being said, you can run Monte Carlo simulations in R to achieve the 
results you want, while providing yourself with options relative to 
study design, intended tests, and adjustments for multiple comparisons 
as apropos. Many prefer this approach, since it gives you specific 
control over this process.

Taking the simple case, where you are going to run a 3 x 2 chi-square as 
your primary endpoint, and want to power for that, here is a possible 
function, with the same sample size in each group:

ThreeGroups <- function(n, p1, p2, p3, R = 10000, power = 0.8) {

   MCSim <- function(n, p1, p2, p3) {
     ## Create a binary distribution for each group
     G1 <- rbinom(n, 1, p1)
     G2 <- rbinom(n, 1, p2)
     G3 <- rbinom(n, 1, p3)

     ## Create a 3 x 2 matrix containing the 3 group counts
     MAT <- cbind(table(G1), table(G2), table(G3))

     ## Perform a chi-square and just return the p value
     chisq.test(MAT)$p.value
   }

   ## Replicate the above R times, and get
   ## a distribution of p values
   MC <- replicate(R, MCSim(n, p1, p2, p3))

   ## Get the p value at the desired "power" quantile
   quantile(MC, power)
}

Essentially, the above internal MCSim() function generates 3 random 
samples of size 'n' from the binomial distribution, at the 3 proportions 
desired. For each run, it will perform a chi-square test of the 3 x 2 
matrix of counts, returning the p value for each run. The main function 
will then return the p value at the quantile (power) within the 
generated distribution of p values.

You can look at the help pages for the various functions that I use 
above, to get a sense for how they work.

You increase the sample size ('n') until you get a p value returned <= 
0.05, if that is your desired alpha level.

You also want 'R', the number of replications within each run, to be 
large enough so that the returned p value quantile is relatively stable. 
Values for 'R', once you get "close to" the desired p value should be on 
the order of 1,000,000 or higher. Stay with lower values for 'R' until 
you get in the ballpark of your target, since larger values take much 
longer to run.

Thus, using your example proportions of 0.25, 0.25, and 0.35:

## 250 per group, 750 total - Not enough
 > ThreeGroups(250, 0.25, 0.25, 0.35, R = 10000)
        80%
0.08884723

## 350 per group, 1050 total - Too high
 > ThreeGroups(350, 0.25, 0.25, 0.35, R = 10000)
       80%
0.0270829

## 300 per group, 900 total - Close!
 > ThreeGroups(300, 0.25, 0.25, 0.35, R = 10000)
        80%
0.04818842

So, keep tweaking the sample size until you get a returned p value at 
your target alpha level, with a large enough 'R', so that you get 
consistent sample sizes for multiple runs.

If I run 300 per group again, with 10,000 replicates:

 > ThreeGroups(300, 0.25, 0.25, 0.35, R = 10000)
        80%
0.05033933

the returned p value is slightly higher. So, again, increase R to 
improve the stability of the returned p value and run it multiple times 
to be comfortable that the p value change is less than an acceptable 
threshold.

Now, the tricky part is to decide if the 3 x 2 is your primary endpoint, 
and want to power only for that, or, if you also want to power for the 
other two-group comparisons, possibly having to account for p value 
adjustments for the multiple comparisons, resulting in the need to power 
for a lower alpha level for those tests. In that scenario, you would end 
up taking the largest sample size that you identify across the various 
hypotheses, recognizing that while you are powering for one hypothesis, 
you may be overpowering for others.

That is something that you need to decide, and perhaps consider 
consulting with other local statistical expertise, as may be apropos, in 
the prospective study design, possibly influenced by other 
relevant/similar research in your domain.

You can easily modify the above function for the two-group scenario as 
well, and I will leave that to you.

Regards,

Marc

AbouEl-Makarim Aboueissa wrote on 8/10/21 6:34 AM:
> Hi Marc:
> 
> First, thank you very much for your help in this matter.
> 
> 
> Will perform an initial omnibus test of all three groups (e.g. 3 x 2 
> chi-square), possibly followed by
> all possible 2 x 2 pairwise comparisons (e.g. 1 versus 2, 1 versus 3, 
> 2 versus 3),
> 
> We can assume _either_ the desired sample size in each group is the same 
> _or_ proportional to the population size.
> 
>   We can set p=0.25 and set p1=p2=p3=p so that the H0 is true.
> 
> We can assume that the expected proportion of "Yes" values in each group 
> is 0.25
> 
> For the alternative hypotheses, for example,  we can set  p1 = .25, 
> p2=.25, p3=.35
> 
> 
> Again thank you very much in advance.
> 
> abou
> 
> ______________________
> 
> *AbouEl-Makarim Aboueissa, PhD
> *
> *
> *
> *Professor, Statistics and Data Science*
> *Graduate Coordinator*
> *Department of Mathematics and Statistics
> *
> *University of Southern Maine*
> 
> 
> 
> On Mon, Aug 9, 2021 at 10:53 AM Marc Schwartz <marc_schwartz using me.com 
> <mailto:marc_schwartz using me.com>> wrote:
> 
>     Hi,
> 
>     You are going to need to provide more information than what you have
>     below and I may be mis-interpreting what you have provided.
> 
>     Presuming you are designing a prospective, three-group, randomized
>     allocation study, there is typically an a priori specification of the
>     ratios of the sample sizes for each group such as 1:1:1, indicating
>     that
>     the desired sample size in each group is the same.
> 
>     You would also need to specify the expected proportions of "Yes" values
>     in each group.
> 
>     Further, you need to specify how you are going to compare the
>     proportions in each group. Are you going to perform an initial omnibus
>     test of all three groups (e.g. 3 x 2 chi-square), possibly followed by
>     all possible 2 x 2 pairwise comparisons (e.g. 1 versus 2, 1 versus 3, 2
>     versus 3), or are you just going to compare 2 versus 1, and 3 versus 1,
>     where 1 is a control group?
> 
>     Depending upon your testing plan, you may also need to account for p
>     value adjustments for multiple comparisons, in which case, you also
>     need
>     to specify what adjustment method you plan to use, to know what the
>     target alpha level will be.
> 
>     On the other hand, if you already have the data collected, thus have
>     fixed sample sizes available per your wording below, simply go ahead
>     and
>     perform your planned analyses, as the notion of "power" is largely an a
>     priori consideration, which reflects the probability of finding a
>     "statistically significant" result at a given alpha level, given that
>     your a priori assumptions are valid.
> 
>     Regards,
> 
>     Marc Schwartz
> 
> 
>     AbouEl-Makarim Aboueissa wrote on 8/9/21 9:41 AM:
>      > Dear All: good morning
>      >
>      > *Re:* Sample Size Determination to Compare Three Independent
>     Proportions
>      >
>      > *Situation:*
>      >
>      > Three Binary variables (Yes, No)
>      >
>      > Three independent populations with fixed sizes (*say:* N1 = 1500,
>     N2 = 900,
>      > N3 = 1350).
>      >
>      > Power = 0.80
>      >
>      > How to choose the sample sizes to compare the three proportions
>     of “Yes”
>      > among the three variables.
>      >
>      > If you know a reference to this topic, it will be very helpful too.
>      >
>      > with many thanks in advance
>      >
>      > abou
>      > ______________________
>      >
>      >
>      > *AbouEl-Makarim Aboueissa, PhD*
>      >
>      > *Professor, Statistics and Data Science*
>      > *Graduate Coordinator*
>      >
>      > *Department of Mathematics and Statistics*
>      > *University of Southern Maine*
>      >
>