[R] Setting up hypothesis tests with the infer library?

David Winsemius dw|n@em|u@ @end|ng |rom comc@@t@net
Sun Mar 30 05:08:10 CEST 2025



> On Mar 29, 2025, at 9:59 AM, Kevin Zembower via R-help <r-help using r-project.org> wrote:
> 
> Hi, Rui and Michael, thank you both for replying.
> 
> Yeah, I'm not supposed to know about Chi-squared yet. So far, all of
> our work with hypothesis tests has involved creating the sample data,
> then resampling it to create a null distribution, and finally computing
> p-values.

You might want to look at the "resample" package. It's been around for a while. One of its authors is Tim Hasterberg, a well respected member of the R community.. I think it was written as support for the original edition of Chihara and Hesterberg: "Mathematical Statistics with Resampling and R, 3rd Edition (2022)". 

-- 
David.

> 
> The prop.test() would work. obviously. I'll look into that. I didn't
> know about it.
> 
> I'm really struck by the changes in statistics teaching methods,
> compared to my first exposure to statistics more than 20 years ago. I
> can't ever remember doing a simulation then, probably due to the lack
> of computing resources. It was all calculations based on the Normal
> curve. Now, we haven't even been introduced to any calculations
> involving the Normal curve, and won't be for another two chapters. It's
> the last chapter we study in this one-semester course. In this course,
> it's all been simulations, bootstraps, randomized distributions, etc.
> 
> Thank you, again, Rui and Michael, for your help for me.
> 
> -Kevin
> 
> On Sat, 2025-03-29 at 16:42 +0000, Rui Barradas wrote:
>> Às 16:09 de 29/03/2025, Kevin Zembower via R-help escreveu:
>>> Hello, all,
>>> 
>>> We're now starting to cover hypothesis tests in my Stats 101
>>> course. As
>>> usual in courses using the Lock5 textbook, 3rd ed., the homework
>>> answers are calculated using their StatKey application. In addition
>>> (and for no extra credit), I'm trying to solve the problems using
>>> R. In
>>> the case of hypothesis test, in addition to manually setting up
>>> randomized null hypothesis distributions and graphing them, I'm
>>> using
>>> the infer library. I've been really impressed with this library and
>>> enjoy solving this type of problem with it.
>>> 
>>> One of the first steps in solving a hypothesis test with infer is
>>> to
>>> set up the initial sampling dataset. Often, in Lock5 problems, this
>>> is
>>> a dataset that can be downloaded with library(Lock5Data). However,
>>> other problems are worded like this:
>>> 
>>> ===========================
>>> In 1980 and again in 2010, a Gallup poll asked a random sample of
>>> 1000
>>> US citizens “Are you in favor of the death penalty for a person
>>> convicted of murder?” In 1980, the proportion saying yes was 0.66.
>>> In
>>> 2010, it was 0.64. Does this data provide evidence that the
>>> proportion
>>> of US citizens favoring the death penalty was higher in 1980 than
>>> it
>>> was in 2010? Use p1 for the proportion in 1980 and p2 for the
>>> proportion in 2010.
>>> ============================
>>> 
>>> I've been setting up problems like this with code similar to:
>>> ===========================
>>> df <- data.frame(
>>>      survey = c(rep("1980", 1000), rep("2010", 1000)),
>>>      DP = c(rep("Y", 0.66*1000), rep("N", 1000 - (0.66*1000)),
>>>             rep("Y", 0.64*1000), rep("N", 1000 - (0.64*1000))))
>>> 
>>> (d_hat <- df %>%
>>>       specify(response = DP, explanatory = survey, success = "Y")
>>> %>%
>>>       calculate(stat = "diff in props", order = c("1980", "2010")))
>>> ============================
>>> 
>>> My question is, is this the way I should be setting up datasets for
>>> problems of this type? Is there a more efficient way, that doesn't
>>> require the construction of the whole sample dataset?
>>> 
>>> It seems like I should be able to do something like this:
>>> =================
>>> (df <- data.frame(group1count = 660, #Or, group1prop = 0.66
>>>                   group1samplesize = 1000,
>>>                   group2count = 640, #Or, group2prop = 0.64
>>>                   group2samplesize = 1000))
>>> =================
>>> 
>>> Am I overlooking a way to set up these sample dataframes for infer?
>>> 
>>> Thanks for your advice and guidance.
>>> 
>>> -Kevin
>>> 
>>> 
>>> 
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> https://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>> 
>> base R is perfectly capable of solving the problem.
>> Something like this.
>> 
>> 
>> year <- c(1980, 2010)
>> p <- c(0.66, 0.64)
>> n <- c(1000, 1000)
>> df1 <- data.frame(year, p, n)
>> df1$yes <- with(df1, p*n)
>> df1$no <- with(df1, n - yes)
>> 
>> mat <- as.matrix(df1[c("yes", "no")])
>> 
>> prop.test(mat)
>> #>
>> #>  2-sample test for equality of proportions with continuity
>> correction
>> #>
>> #> data:  mat
>> #> X-squared = 0.79341, df = 1, p-value = 0.3731
>> #> alternative hypothesis: two.sided
>> #> 95 percent confidence interval:
>> #>  -0.02279827  0.06279827
>> #> sample estimates:
>> #> prop 1 prop 2
>> #>   0.66   0.64
>> 
>> chisq.test(mat)
>> #>
>> #>  Pearson's Chi-squared test with Yates' continuity correction
>> #>
>> #> data:  mat
>> #> X-squared = 0.79341, df = 1, p-value = 0.3731
>> 
>> 
>> Hope this helps,
>> 
>> Rui Barradas
>> 
>> 
> 
> 
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list