Proportion Inference

David Gerbing

The analysis of proportions is of two primary types.

From standard base R functions, the lessR function Prop_test(), abbreviated prop(), provides either type of analysis. To use, generally enter either the original data from which to compute the frequencies and then the sample proportions, or enter already computed frequencies. For the analysis of multiple categorical variables across two levels of one of the variables, the test of homogeneity and the test of independence yield the identical statistical result.

The following table summarizes the values of the Prop_test() parameters for different analyses of proportions. Each function call for the analysis of data begins with the name of a categorical variable, generically referred to as X. The value of X is the first parameter in the function definition, and so does not need its parameter name, variable. If needed, indicate a second categorical variable, generically referred to as Y, with the by parameter. If focused on a specific value of X as a success, referred to as X_value, indicate that value with the success parameter.

Run each analysis either directly from pre-computed values of the sample proportions, or from the original data from which the sample proportions are calculated.

Evaluate Data Parameters Count Parameters
A hypothesized proportion X, success=X_value n_succ, n_tot [scalars]
Equal proportions across samples X, success=X_value, by=Y n_succ, n_tot [vectors]
Uniform goodness-of-fit X ntot [vector]
Independence of two variables X, by=Y n_table

The remainder of this vignette illustrates these applications of Prop_test().

Test a Specified Proportion

Define the occurrence of a designated value of the variable as a success. Define all other values of the variable as failures. Of course, success or failure in this context does not necessarily mean good or bad, desired or undesired, but instead, a designated value either occurred or did not.

When analyzing proportions from data, first indicate the categorical variable, the value of the parameter variable. Next, indicate the designated value of variable with the parameter success. When entering proportions directly, indicate the number of successes and the total number of trials with the n_succ and n_tot parameters. Enter the value of each parameter either as a single value for one sample or as a vector of multiple values for multiple samples. Without a value for success or n_succ the analysis is of goodness-of-fit or independence.

Single Proportion

The example below is from the documentation for the base R function binom.test(), which provides the exact test of a null hypothesis regarding the probability of success. Prop_test() uses that base R function to compare a sample proportion to a hypothesized population value.

From Input Frequencies

For a given categorical variable of interest, a type of plant, consider two values, either “giant” or “dwarf”. From a sample of 925 plants, the specified value of “giant” occurred 682 times and did not occur 243 times. The null hypothesis tested is that the specified value occurs for 3/4 of the population according to the pi parameter.

Prop_test(n_succ=682, n_fail=243, pi=.75)
## 
## <<< Exact binomial test of a proportion 
## 
## ------ Describe ------
## 
## Number of successes: 682 
## Number of failures: 243 
## Number of trials: 925 
## Sample proportion: 0.737 
## 
## ------ Infer ------
## 
## Hypothesis test for null of 0.75, p-value: 0.382
## 95% Confidence interval: 0.708 to 0.765

From Data

To illustrate with data, read the Jackets data file included with lessR into the data frame d. The file contains two categorical variables. The variable Bike represents two different types of motorcycle: BMW and Honda. The second variable is Jacket with three values of jacket thickness: Lite, Med, and Thick. Because d is the default name of the data frame that contains the variables for analysis, the data parameter that names the input data frame need not be specified.

d <- Read("Jackets")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1      Bike character   1025       0       2   BMW  Honda  Honda ... Honda  Honda  BMW
##  2    Jacket character   1025       0       3   Lite  Lite  Lite ... Lite  Med  Lite
## ------------------------------------------------------------------------------------------

In following example, for the variable Bike from the default d data frame, define the parameter success as the value “BMW”. The default null hypothesis is a population value of 0.5, but here explicitly specify with the parameter pi.

For clarity, the following example includes the parameter names listed with their corresponding values. These names are unnecessary in this example, however, because the values are listed in the same order of their definition of the Prop_test() function.

Prop_test(variable=Bike, success="BMW", pi=0.5)
## 
## <<< Exact binomial test of a proportion 
## 
## variable: Bike 
## success: BMW 
## 
## ------ Describe ------
## 
## Number of missing values: 0 
## Number of successes: 418 
## Number of failures: 607 
## Number of trials: 1025 
## Sample proportion: 0.408 
## 
## ------ Infer ------
## 
## Hypothesis test for null of 0.5, p-value: 0.000
## 95% Confidence interval: 0.378 to 0.439

Reject the null hypothesis, with a \(p\)-value of 0.000, less than \(\alpha = 0.05\). The sample result of the sample proportion \(p=0.408\) is considered far from the hypothesized value of \(0.5\) for the proportion of "BMW" values for Bike. Conclude that the data were sampled from a population with a population proportion of BMW different from 0.5.

Multiple Proportions

The following example is from the base R prop.test() documentation, which the lessR Prop_test() relies upon to compare proportions across different groups.

From Input Frequencies

The null hypothesis in this example is that the four populations of patients from which the samples were drawn have the same population proportion of smokers. The alternative is that at least one population proportion is different. Label the groups in the output by providing a named vector for the successes.

To indicate multiple proportions across groups, provide multiple values for the n_succ and n_tot parameters. Optionally, name the groups.

smokers <- c(83, 90, 129, 70)
names(smokers) <- c("Group1","Group2","Group3","Group4")
patients <- c(86, 93, 136, 82)
Prop_test(n_succ=smokers, n_tot=patients)
## 
## <<< 4-sample test for equality of proportions without continuity correction 
## 
## 
## --- Description
## 
##               Group1   Group2   Group3   Group4
## -----------  -------  -------  -------  -------
## n_                83       90      129       70
## n_total           86       93      136       82
## proportion     0.965    0.968    0.949    0.854
## 
## --- Inference
## 
## Chi-square statistic: 12.600 
## Degrees of freedom: 3 
## Hypothesis test of equal population proportions: p-value = 0.006

The result of the test is \(p\)-value \(=0.006 < \alpha=0.05\), so reject the null hypothesis of equal probabilities across the corresponding four populations. Conclude that at least one of the population proportions of smokers differ.

From Data

In the following example, duplicate the previous results, but in this example from data. To illustrate, create the data frame d according to the proportions of smokers and non-smokers with respective values “smoke” and “nosmoke”. Of course, in actual data analysis the data would already be available.

sm1 <- c(rep("smoke", 83), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm3 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm4 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm <- c(sm1, sm2, sm3, sm4)
grp <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
d <- data.frame(sm, grp)

To test if the different groups have the same population proportion of success, retain the syntax for a single proportion for the categorical variable of interest. Define success by the value of this variable, here “smoke”. However, an additional parameter by indicates the variable that defines the groups, a variable that contains a label that identifies the corresponding group for each row of data. The grouping variable in this example is grp, with values the first four uppercase letters of the alphabet. The first five rows of data are shown below.

head(d)
##      sm grp
## 1 smoke   A
## 2 smoke   A
## 3 smoke   A
## 4 smoke   A
## 5 smoke   A
## 6 smoke   A

The relevant parameters variable, success, and by are listed in their given order in this example, so the parameter names are unnecessary. List the names for clarity.

Prop_test(variable=sm, success="smoke", by=grp)
## 
## <<< 4-sample test for equality of proportions without continuity correction 
## 
## variable: sm 
## success: smoke 
## by: grp 
## 
## --- Description
## 
##                   A       B       C       D
## -----------  ------  ------  ------  ------
## n_smoke          83      90     129      70
## n_total          86      93     136      82
## proportion    0.965   0.968   0.949   0.854
## 
## --- Inference
## 
## Chi-square statistic: 12.600 
## Degrees of freedom: 3 
## Hypothesis test of equal population proportions: p-value = 0.006

The analysis of data that matches the previously input proportions, of course, provides the same results as providing the proportions directly.

Tests without a Specified Proportion

Goodness-of-Fit

For the previously discussed test of homogeneity of the values of a single categorical variable, the proportion of occurrences for a specific value across different samples is of interest. Here, instead calculate the proportion of occurrence for each value from the total number of occurrences, as one sample from a single population. In addition to the inference test, the following are also reported: - The observed and expected frequencies - The residual of expected from observed - The standardized version of the residual

From Input Frequencies

For the goodness-of-fit test to a uniform distribution, provide the frequencies for each group for the parameter n_tot. The default null hypothesis is that the proportions of the different categories of a categorical variable are equal.

In this example, enter three frequencies as a vector for the n_tot parameter value. Optionally, make the vector a named vector to label the output accordingly.

x = c(372, 342, 311)
names(x) = c("Lite", "Med", "Thick")
Prop_test(n_tot=x)
## 
## <<< Chi-squared test for given probabilities 
## 
## 
## --- Description
## 
##                Lite       Med     Thick
## ---------  --------  --------  --------
## observed        372       342       311
## expected    341.667   341.667   341.667
## residual      1.641     0.018    -1.659
## stdn res      2.010     0.022    -2.032
## 
## --- Inference
## 
## Chi-square statistic: 5.446 
## Degrees of freedom: 2 
## Hypothesis test of equal population proportions: p-value = 0.066

This example does not quite attain significance at the customary 5% level, with \(p\)-value \(= 0.066 > \alpha = 0.05\). A difference of the corresponding population proportions was not detected.

From Data

The same analysis follows from the data. Just specify the name of the categorical variable of interest.

d <- Read("Jackets", quiet=TRUE)
Prop_test(Jacket)
## 
## <<< Chi-squared test for given probabilities 
## 
## variable: Jacket 
## 
## --- Description
## 
##                Lite       Med     Thick
## ---------  --------  --------  --------
## observed        372       342       311
## expected    341.667   341.667   341.667
## residual      1.641     0.018    -1.659
## stdn res      2.010     0.022    -2.032
## 
## --- Inference
## 
## Chi-square statistic: 5.446 
## Degrees of freedom: 2 
## Hypothesis test of equal population proportions: p-value = 0.066

Independence

Tests of independence evaluated here rely upon a contingency table of two dimensions also called a cross-tabulation table or joint frequency table. Enter the joint frequencies directly or compute from the data. The corresponding analysis provides the chi-square test for the null hypothesis of independence.

Also provided is Cramer’s V to indicate the extent of the relationship of the two categorical variables. For each cell frequency, the expected value given the independence assumption is provided, along with the corresponding residual from the observed frequency and the corresponding standardized residual.

From Input Frequencies

To enter the joint frequency table directly, store the frequencies in a file accessible from your computer system. One possibility is to enter the numbers into a text file with file type .csv or .txt. Enter the numbers with a text editor, or with a word processor saving the file as a text file. This file format separates the adjacent values in each row with a comma, as indicated below. Or, enter the numbers into an MS Excel formatted file with file type .xlsx. Enter only the numeric frequencies, no labels.

For example, consider the following joint frequency table with four levels of the column variable and four levels of the row variable, here in csv format.

3,58,6,105
41,79,9,207
86,179,27,484
143,214,31,824

After saving the file, call Prop_test() using the parameter n_table to indicate the path name to the file, enclosed in quotes. Or, leave the quotes empty to browse for the joint frequency table.

This table is included in a file downloaded with lessR with the name FreqTable99. That name triggers an internal process that locates the file within the lessR installation without needing to construct a rather complicated path name as part of this example. That also means that the name becomes a reserved key word with its use always triggering the following example.

In general, replace FreqTable99 in this example with your own path name to your file of joint frequencies, or just delete the name leaving only the two quotes to indicate to browse for the file.

Prop_test(n_table="FreqTable99")
## 
## <<< Pearson's Chi-squared test 
## 
## --- Description
## 
## Cell Frequencies               
##    3  58  6 105
##   41  79  9 207
##   86 179 27 484
##  143 214 31 824
## 
##  Cramer's V: 0.075 
## 
##  Row Col Observed Expected Residual Stnd Res
##    1   1        3   18.812  -15.812   -4.003
##    1   2       58   36.522   21.478    4.150
##    1   3        6    5.030    0.970    0.455
##    1   4      105  111.635   -6.635   -1.098
##    2   1       41   36.750    4.250    0.799
##    2   2       79   71.346    7.654    1.098
##    2   3        9    9.827   -0.827   -0.288
##    2   4      207  218.077  -11.077   -1.361
##    3   1       86   84.875    1.125    0.156
##    3   2      179  164.776   14.224    1.504
##    3   3       27   22.696    4.304    1.105
##    3   4      484  503.654  -19.654   -1.781
##    4   1      143  132.562   10.438    1.339
##    4   2      214  257.356  -43.356   -4.246
##    4   3       31   35.447   -4.447   -1.057
##    4   4      824  786.635   37.365    3.135
## 
## --- Inference
## 
## Chi-square statistic: 41.732 
## Degrees of freedom: 9 
## Hypothesis test of equal population proportions: p-value = 0.000

Do not have the path name to your file readily available? Then browse for the file. The following example is not run as it cannot run in this vignette.

Prop_test(n_table="")

The full path name for the file is provided as part of the output.

From Data

The \(\chi^2\) test of independence evaluated here applies to two categorical variables. The first categorical variable listed in this example is the value of the parameter variable, the first parameter in the function definition, so does not need the parameter name. The second categorical variable listed must include the parameter name by.

The question for the analysis is if the observed frequencies of Jacket thickness and Bike ownership sufficiently differ from the frequencies expected by the null hypothesis that we conclude the variables are related.

Prop_test(Jacket, by=Bike)
## variable: Jacket 
## by: Bike 
## 
## <<< Pearson's Chi-squared test 
## 
## --- Description
## 
##        Jacket
## Bike    Lite  Med Thick  Sum
##   BMW     89  135   194  418
##   Honda  283  207   117  607
##   Sum    372  342   311 1025
## 
##  Cramer's V: 0.319 
## 
##  Row Col Observed Expected Residual Stnd Res
##    1   1       89  151.703  -62.703   -8.288
##    1   2      135  139.469   -4.469   -0.602
##    1   3      194  126.827   67.173    9.287
##    2   1      283  220.297   62.703    8.288
##    2   2      207  202.531    4.469    0.602
##    2   3      117  184.173  -67.173   -9.287
## 
## --- Inference
## 
## Chi-square statistic: 104.083 
## Degrees of freedom: 2 
## Hypothesis test of equal population proportions: p-value = 0.000

The result of this test is that the \(p\)-value = 0.000 \(< \alpha=0.05\), so reject the null hypothesis of independence. Conclude that the type of Bike a person rides and the thickness of their Jacket are related.

To visualize the relationship of the two variables, use the same function call syntax, but now to BarChart() instead of Prop_test(). The visualization is accompanied by the same \(\chi^2\) test of independence.

BarChart(Jacket, by=Bike)

## >>> Suggestions
## Plot(Jacket, Bike)  # bubble plot
## BarChart(Jacket, by=Bike, horiz=TRUE)  # horizontal bar chart
## BarChart(Jacket, fill="steelblue")  # steelblue bars 
## 
## Joint and Marginal Frequencies 
## ------------------------------ 
##  
##     Jacket 
## Bike    Lite Med Thick  Sum 
##   BMW     89 135   194  418 
##   Honda  283 207   117  607 
##   Sum    372 342   311 1025 
## 
## Cramer's V: 0.319 
##  
## Chi-square Test of Independence:
##      Chisq = 104.083, df = 2, p-value = 0.000

The visualization depicts the relationship between motorcycle and jacket: Honda riders prefer thinner jackets, and BMW riders prefer thicker jackets. To speculate, perhaps because the BMW bikes are sportier, their riders are more concerned with going down on the pavement.

This relationship becomes even clearer to visualize with the corresponding 100% stack bar graph. Each bar representing a jacket choice in this visualization shows the percentage of riders with each type of motorcycle for that jacket.

BarChart(Jacket, by=Bike, stack100=TRUE)

## >>> Suggestions
## Plot(Jacket, Bike)  # bubble plot
## BarChart(Jacket, by=Bike, horiz=TRUE)  # horizontal bar chart
## BarChart(Jacket, fill="steelblue")  # steelblue bars 
## 
## Joint and Marginal Frequencies 
## ------------------------------ 
##  
##     Jacket 
## Bike    Lite Med Thick  Sum 
##   BMW     89 135   194  418 
##   Honda  283 207   117  607 
##   Sum    372 342   311 1025 
## 
## Cramer's V: 0.319 
##  
## Chi-square Test of Independence:
##      Chisq = 104.083, df = 2, p-value = 0.000 
## 
## Cell Proportions within Each Column 
## ----------------------------------- 
##  
##     Jacket 
## Bike       Lite     Med   Thick 
##   BMW     0.239   0.395   0.624 
##   Honda   0.761   0.605   0.376 
##   Sum     1.000   1.000   1.000

From this visualization we see that 24% of Lite jacket owners are BMW riders, and, in contrast, 62% of the owners of Heavy jackets are BMW riders.