[R] Setting up hypothesis tests with the infer library?
David Winsemius
dw|n@em|u@ @end|ng |rom comc@@t@net
Sun Mar 30 05:08:10 CEST 2025
> On Mar 29, 2025, at 9:59 AM, Kevin Zembower via R-help <r-help using r-project.org> wrote:
>
> Hi, Rui and Michael, thank you both for replying.
>
> Yeah, I'm not supposed to know about Chi-squared yet. So far, all of
> our work with hypothesis tests has involved creating the sample data,
> then resampling it to create a null distribution, and finally computing
> p-values.
You might want to look at the "resample" package. It's been around for a while. One of its authors is Tim Hasterberg, a well respected member of the R community.. I think it was written as support for the original edition of Chihara and Hesterberg: "Mathematical Statistics with Resampling and R, 3rd Edition (2022)".
--
David.
>
> The prop.test() would work. obviously. I'll look into that. I didn't
> know about it.
>
> I'm really struck by the changes in statistics teaching methods,
> compared to my first exposure to statistics more than 20 years ago. I
> can't ever remember doing a simulation then, probably due to the lack
> of computing resources. It was all calculations based on the Normal
> curve. Now, we haven't even been introduced to any calculations
> involving the Normal curve, and won't be for another two chapters. It's
> the last chapter we study in this one-semester course. In this course,
> it's all been simulations, bootstraps, randomized distributions, etc.
>
> Thank you, again, Rui and Michael, for your help for me.
>
> -Kevin
>
> On Sat, 2025-03-29 at 16:42 +0000, Rui Barradas wrote:
>> Às 16:09 de 29/03/2025, Kevin Zembower via R-help escreveu:
>>> Hello, all,
>>>
>>> We're now starting to cover hypothesis tests in my Stats 101
>>> course. As
>>> usual in courses using the Lock5 textbook, 3rd ed., the homework
>>> answers are calculated using their StatKey application. In addition
>>> (and for no extra credit), I'm trying to solve the problems using
>>> R. In
>>> the case of hypothesis test, in addition to manually setting up
>>> randomized null hypothesis distributions and graphing them, I'm
>>> using
>>> the infer library. I've been really impressed with this library and
>>> enjoy solving this type of problem with it.
>>>
>>> One of the first steps in solving a hypothesis test with infer is
>>> to
>>> set up the initial sampling dataset. Often, in Lock5 problems, this
>>> is
>>> a dataset that can be downloaded with library(Lock5Data). However,
>>> other problems are worded like this:
>>>
>>> ===========================
>>> In 1980 and again in 2010, a Gallup poll asked a random sample of
>>> 1000
>>> US citizens “Are you in favor of the death penalty for a person
>>> convicted of murder?” In 1980, the proportion saying yes was 0.66.
>>> In
>>> 2010, it was 0.64. Does this data provide evidence that the
>>> proportion
>>> of US citizens favoring the death penalty was higher in 1980 than
>>> it
>>> was in 2010? Use p1 for the proportion in 1980 and p2 for the
>>> proportion in 2010.
>>> ============================
>>>
>>> I've been setting up problems like this with code similar to:
>>> ===========================
>>> df <- data.frame(
>>> survey = c(rep("1980", 1000), rep("2010", 1000)),
>>> DP = c(rep("Y", 0.66*1000), rep("N", 1000 - (0.66*1000)),
>>> rep("Y", 0.64*1000), rep("N", 1000 - (0.64*1000))))
>>>
>>> (d_hat <- df %>%
>>> specify(response = DP, explanatory = survey, success = "Y")
>>> %>%
>>> calculate(stat = "diff in props", order = c("1980", "2010")))
>>> ============================
>>>
>>> My question is, is this the way I should be setting up datasets for
>>> problems of this type? Is there a more efficient way, that doesn't
>>> require the construction of the whole sample dataset?
>>>
>>> It seems like I should be able to do something like this:
>>> =================
>>> (df <- data.frame(group1count = 660, #Or, group1prop = 0.66
>>> group1samplesize = 1000,
>>> group2count = 640, #Or, group2prop = 0.64
>>> group2samplesize = 1000))
>>> =================
>>>
>>> Am I overlooking a way to set up these sample dataframes for infer?
>>>
>>> Thanks for your advice and guidance.
>>>
>>> -Kevin
>>>
>>>
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> https://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>>
>> base R is perfectly capable of solving the problem.
>> Something like this.
>>
>>
>> year <- c(1980, 2010)
>> p <- c(0.66, 0.64)
>> n <- c(1000, 1000)
>> df1 <- data.frame(year, p, n)
>> df1$yes <- with(df1, p*n)
>> df1$no <- with(df1, n - yes)
>>
>> mat <- as.matrix(df1[c("yes", "no")])
>>
>> prop.test(mat)
>> #>
>> #> 2-sample test for equality of proportions with continuity
>> correction
>> #>
>> #> data: mat
>> #> X-squared = 0.79341, df = 1, p-value = 0.3731
>> #> alternative hypothesis: two.sided
>> #> 95 percent confidence interval:
>> #> -0.02279827 0.06279827
>> #> sample estimates:
>> #> prop 1 prop 2
>> #> 0.66 0.64
>>
>> chisq.test(mat)
>> #>
>> #> Pearson's Chi-squared test with Yates' continuity correction
>> #>
>> #> data: mat
>> #> X-squared = 0.79341, df = 1, p-value = 0.3731
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>>
>
>
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list