[R] What don't I understand about sample()?
Rui Barradas
ru|pb@rr@d@@ @end|ng |rom @@po@pt
Sun Mar 16 10:05:29 CET 2025
Hello,
Inline.
Às 01:31 de 16/03/2025, avi.e.gross using gmail.com escreveu:
> Rui,
>
> I just want to discuss this point you made for now:
>
> 2.
> Systematic use of apply(., 1, mean).
> rowMeans (and colMeans) are much faster.
>
> You are, of course, technically correct. I consider an alternative aspect in which you are teaching a programming language as a small set of "commands" that can do quite a bit in various legal combinations. These are often reserved words and primitives like "if" or operators like "<-" and so on.
>
> You then can discuss various built-in functionality that extends your abilities such as an array of built in functions commonly used. I see these in several categories. Some, can be examined in R at the prompt to see what R code produces it. You can learn how it works and perhaps cannibalize from it to make your own code that uses helpful parts and perhaps avoids functionality you do not need.
>
> The other kinds are relatively hidden as .Internal or a function that after someprocessing then calls .Internal(...) and can be code written in various (perhaps usually compiled) languages such as a version of C, or perhaps pulled from libraries such as FORTRAN mathematical functions. These are sometimes faster partially because they are not being read and interpreted, or perhaps use cute programming tricks.
>
> So, as I see it, whatever teacher or book or course is being used by the OP, they are supposedly learning ways to approach problems from some level in which they want them to learn about ideas. The goal is not just learning how to do one problem, but to add to your tool bag for additional problems. And, in R, some of those ideas revolve in how to do implicit loops and in some sense do vectorized operations.
>
> So, if I understood it, the assignment included getting some number of equal-sized vectors and instead of working on them one at a time, assemble them into a collection of vectors. Besides generalized lists, R supports dataframes as a list of vectors (or actually other things) and matrices implemented as vectors with attributes and some other structures.
>
> The assignment seems to ask them to assemble the vectors into rows of a matrix object and then use something that operates on one row at a time, as already discussed. Clearly, someone has already made implementations that allow you to avoid directly using apply() and get the same, faster result, using rowMeans. But note that although there are perhaps additional rowwise operations available this way, such as rowSums, in the general case, there are many possible functions not already supplied.
>
> If I want something like a rowMinMax, or something exotic where I consider the row to contain the coefficients of a polynomial to evaluate for some given value for X, then a less efficient solution can easily be put together using something like apply, or if using other data structures, using cousins like lapply/sapply/vapply and quite a few other methods in packages including functions like pmap and interesting methods in dplyr such as rowwise that allow per row calculations easily.
I couldn't agree more. My comment applies to the use of apply(., 1,
mean), not to the use of apply() in general. In the text, mean() is the
only function apply'ed, that's why my comment.
I have read several times that when R users learn how to effectively use
apply/lapply then they've made it through R's [steep] learning curve. I
know I felt that way when I learned those functions.
>
> There are usually many ways you can do something in R, as in many languages, and sometimes a highly efficient solution for some situations is not the best or most efficient. An example might be how some forms of looping are better suited when the number of items is not known in advance. Often a compiled language will optimize away a loop over a fixed small number of items like asking for index going from 1 to 2 to calculate the squares of the index and storing them in an array of two. It may get replaced without a loop and calling a function twice, or even doing the calculation at compile them and inserting a data structure with the results and not doing anything major at run time.
>
> In this case, the OP may actually have a small fixed number of such results and possibly there are faster ways to do the calculation. But, if the point is to learn how to do arbitrary calculations when the magnitude of the parts may vary, then this is reasonable. Of course, if the samples being used are not all the same length, this method can fail badly while other methods may still work well.
>
> I note relatedly, that one of us here has expressed a wish to do some of their code as near to using base R as possible rather than jump right away into the tidyverse.
Just to use read_csv when base R has read.csv. The OP is reading data
from the internet so read.csv is all that's needed.
And, related, it is not necessary to load a package to have access to
its data sets. For instance,
data(CommuteAtlanta, package = "Lock5Data")
will load the data used for the example of section 1.1.
I can respect that even as, I, personally, prefer using whatever works
for me faster and easier. At least, that is true when I am prototyping
or doing something one-off. If the code turns out to be used a lot and
is slow, I might reconsider. But, having said that, I suspect some
packages may allow faster code than the basics. Of course, things like
the built-in native pipe |> can change things so an altered package
using a slower pipe may no longer ...
Thanks for the notes, they are always welcome.
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Rui Barradas
> Sent: Saturday, March 15, 2025 5:28 PM
> To: Kevin Zembower <kevin using zembower.org>; r-help using r-project.org
> Subject: Re: [R] What don't I understand about sample()?
>
> Hello,
>
> I have been following this thread and though answers have been given,
> some of them address R coding in general, not necessarily the sample()
> function. Here are some random notes I think the OP could use, prompted
> by the text linked to, chap3.pdf.
>
> 1.
> Throughout the text, assignments use the equal sign instead of the left
> arrow. The left arrow is generally considered more idiomatic and there
> is an important diference beteewn he wo, see help("assignOps").
>
>
> time.mean = with(CommuteAtlanta, mean(Time))
> B = 1000
> n = nrow(CommuteAtlanta)
>
> # This should be used, not the above.
> time.mean <- with(CommuteAtlanta, mean(Time))
> B <- 1000
> n <- nrow(CommuteAtlanta)
>
>
> 2.
> Systematic use of apply(., 1, mean).
> rowMeans (and colMeans) are much faster.
>
>
> boot.statistics <- apply(boot.samples, 1, mean)
> boot.statistics <- rowMeans(boot.samples)
>
>
> 3.
> The first confidence interval computation seems awkward. I had never
> seen this way of computing a CI95.
> Moreover, it's plain common sense to keep the results with the returned
> decimals and round for display purposes only.
> And the normal intervals are computed in a more usual way later in the
> text, see sections 1.2 and 1.3.
>
>
> me <- ceiling(10 * 2 * time.se)/10
> round(time.mean, 1) + c(-1, 1) * me
>
> # Straightforward.
> normal_ci95 <- time.mean + c(-1, 1) * 2 * time.se
> normal_ci95
> round(normal_ci95, 1)
>
> # section 1.2 , function boot.mean
> interval = mean(x) + c(-1,1)*2*se
>
> # section 1.3
> with(students, mean(Height) + c(-1, 1) * 2 * sd(result))
>
>
>
> 4.
> In section 1.2 there is a bootstrap function boot.mean().
> The function could be improved to let users pass a conf.level of their
> choice.
> And why force the function user to always have the plot displayed? Base
> functions hist() and barplot() have na argument 'plot' with default TRUE
> allowing the user to choose.
>
> The following seems more user friendly.
>
>
> boot.mean <- function(x, B, binwidth = NULL, conf.level = 0.95, plot =
> TRUE) {
> require(ggplot2)
> n <- length(x)
> boot.samples <- matrix( sample(x, size = n*B, replace = TRUE), B, n)
> boot.statistics <- rowMeans(boot.samples)
> se <- sd(boot.statistics)
> if ( is.null(binwidth) )
> binwidth <- diff(range(boot.statistics))/30
>
> p <- ggplot(data.frame(x = boot.statistics), aes(x = x)) +
> geom_histogram(aes(y = ..density..),binwidth = binwidth) +
> geom_density(color = "red")
>
> alpha <- 1 - (1 - conf.level)/2
> interval <- mean(x) + c(-1, 1) * qnorm(alpha) * se
>
> if(plot) {
> plot(p)
> }
>
> list(boot.statistics = boot.statistics, interval = interval, se = se,
> plot = p)
> }
>
>
> Hope this helps,
>
> Rui Barradas
>
>
>
> Às 17:28 de 15/03/2025, Kevin Zembower via R-help escreveu:
>> Hi, Richard, thanks for replying. I should have mentioned the third
>> edition, which we're using. The data file didn't change between the
>> second and third editions, and the data on Body Mass Gain was the same
>> as in the first edition, although the first edition data file contained
>> additional variables.
>>
>> According to my text, the BMGain was measured in grams. Thanks for
>> pointing out that my statement of the problem lacked crucial
>> information.
>>
>> The matrix in my example comes from an example in
>> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
>> created a bootstrap example with a matrix that consisted of one row for
>> every sample in the bootstrap, and one column for each mean in the
>> original data. This allowed him to find the mean for each row to create
>> the bootstrap statistics.
>>
>> The only need for the tidyverse is to use the read_csv() function. I'm
>> regrettably lazy in not determining which of the multiple functions in
>> the tidyverse library loads read_csv(), and just using that one.
>>
>> Thanks, again, for helping me to further understand R and this problem.
>>
>> -Kevin
>>
>> On Sat, 2025-03-15 at 12:00 +0100, r-help-request using r-project.org wrote:
>>> Not having the book (and which of the three editions are you using?),
>>> I downloaded the data and played with it for a bit.
>>> dotchart() showed the Dark and Light conditions looked quite
>>> different, but also showed that there are not very many cases.
>>> After trying t.test, it occurred to me that I did not know whether
>>> "BMGain" means gain in *grams* or gain in *percent*.
>>> Reflection told me that for a growth experiment, percent made more
>>> sense, which reminded my of one of my first
>>> student advising experiences, where I said "never give the computer
>>> percentages; let IT calculate the percentages
>>> from the baseline and outcome, because once you've thrown away
>>> information, the computer can't magically get it back."
>>> In particular, in the real world I'd be worried about the possibility
>>> that there was some confounding going on, so I would
>>> much rather have initial weight and final weight as variables.
>>> If BMGain is an absolute measure, the p value for a t test is teeny
>>> tiny.
>>> If BMGain is a percentage, the p value for a sensible t test is about
>>> 0.03.
>>>
>>> A permutation test went like this.
>>> is.light <- d$Group == "Light"
>>> is.dark <- d$Group == "Dark"
>>> score <- function (g) mean(g[is.light]) - mean(g[is.dark])
>>> base.score <- score(d$BMGain)
>>> perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
>>> sum(perm.scores >= base.score) / length(perm.scores)
>>>
>>> I don't actually see where matrix() comes into it, still less
>>> anything
>>> in the tidyverse.
>>>
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus.
www.avg.com
More information about the R-help
mailing list