[Rd] proposed change to 'sample'
William Dunlap
wdunlap at tibco.com
Sun Jun 20 19:49:43 CEST 2010
> -----Original Message-----
> From: r-devel-bounces at r-project.org
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Patrick Burns
> Sent: Sunday, June 20, 2010 3:08 AM
> To: r-devel at r-project.org
> Subject: [Rd] proposed change to 'sample'
>
> There is a weakness in the 'sample'
> function that is highlighted in the
> help file. The 'x' argument can be
> either the vector from which to sample,
> or the maximum value of the sequence
> from which to sample.
>
> This can be ambiguous if the length of
> 'x' is one.
>
> I propose adding an argument that allows
> the user (programmer) to avoid that
> ambiguity:
>
> function (x, size, replace = FALSE, prob = NULL,
> max = length(x) == 1L && is.numeric(x) && x >= 1)
S+'s sample() has an argument 'n' to achieve
the same result. It has been there since at
least 2005 (S+ 7.0.6). sample(n=n) means to
return a sample from seq_along(n), where n must
be a scalar nonnegative integer. sample(x=x)
retains it old ambiguous meaning.
sample(x, size = n, replace = F, prob = NULL, n = NULL, ...)
S+ also has an rsample function where n (with
the same meaning) is the only way to specify the
population. It also has an order=TRUE/FALSE argument
where order=TRUE means to randomly order the output.
order=FALSE means that the ordering of the output is
unspecified, but it allows the person writing rsample
methods to use the quickest way to get a random sample
(for big data it can be fastest to return the sample
from 1:n in increasing order).
rsample(n, size = n, replace = F, prob = NULL,
bigdata = F, minimal = NULL, ..., order = T)
I like the idea of separating the concepts of sampling
and permuting data. Many statistics are invariant to
ordering of the data and it can be a waste of time
to randomly order a sample to feed to such functions.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> {
> if (max) {
> if (missing(size))
> size <- x
> .Internal(sample(x, size, replace, prob))
> }
> else {
> if (missing(size))
> size <- length(x)
> x[.Internal(sample(length(x), size, replace, prob))]
> }
> }
> <environment: namespace:base>
>
>
> This just takes the condition of the first
> 'if' to be the default value of the new 'max'
> argument.
>
> So in the "surprise" section of the examples
> in the 'sample' help file
>
> sample(x[x > 9])
>
> and
>
> sample(x[x > 9], max=FALSE)
>
> have different behaviours.
>
> By the way, I'm certainly not convinced that
> 'max' is the best name for the argument.
>
> --
> Patrick Burns
> pburns at pburns.seanet.com
> http://www.burns-stat.com
> (home of 'Some hints for the R beginner'
> and 'The R Inferno')
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list