[Rd] proposed change to 'sample'
William Dunlap
wdunlap at tibco.com
Mon Jun 21 06:04:13 CEST 2010
> -----Original Message-----
> From: Peter Dalgaard [mailto:pdalgd at gmail.com]
> Sent: Sunday, June 20, 2010 2:12 PM
> To: William Dunlap
> Cc: Patrick Burns; r-devel at r-project.org
> Subject: Re: [Rd] proposed change to 'sample'
>
> William Dunlap wrote:
> >> -----Original Message-----
> >> From: r-devel-bounces at r-project.org
> >> [mailto:r-devel-bounces at r-project.org] On Behalf Of Patrick Burns
> ....
> >>
> >> I propose adding an argument that allows
> >> the user (programmer) to avoid that
> >> ambiguity:
> >>
> >> function (x, size, replace = FALSE, prob = NULL,
> >> max = length(x) == 1L && is.numeric(x) && x >= 1)
> >
> > S+'s sample() has an argument 'n' to achieve
> > the same result. It has been there since at
> > least 2005 (S+ 7.0.6). sample(n=n) means to
> > return a sample from seq_along(n), where n must
> > be a scalar nonnegative integer. sample(x=x)
> > retains it old ambiguous meaning.
> > sample(x, size = n, replace = F, prob = NULL, n = NULL, ...)
>
> Hmm, that doesn't really solve the issue does it? I.e., you
> still cannot
> conveniently sample from a vector that is possibly of size 1.
>
> I would be more inclined to make sampling from a vector the
> normal case,
> and default x to say 1:max(n, size), forcing users to say
> sample(n=5) if
> sampling from x=1:5 is desired. This could be a manageable change; the
> deprecation sequence is a bit painful to think through, though.
I think that the breaking of old code was why we
allowed the user to use an unambiguous sample(n=n),
but didn't change how sample(x=scalar) worked.
Internally, we had long discouraged using sample(x=vector)
because of the ambiguity problem, preferring
x[sample(length(x),...)].
I notice that S+'s rsample() does not allow sampling
from a vector, only from seq_len(n). I think that
is because it was felt that sampling rows from a data.frame
(or the bigdata equivalent, bdframe) was a more common
operation and the code was simpler/faster if rsample didn't
have to call out to possible subscripting methods. Relaxing
the requirement that the output be a randomly permuted
sample was a bigger requirement when dealing with long
datasets.
In any case, I was just stating that if sample were
changed to allow disambiguation of its first argument,
using 'n' instead of 'max' would be compatible with S+.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
>
> --
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
>
More information about the R-devel
mailing list