[Rd] proposed change to 'sample'

Mon Jun 21 06:04:13 CEST 2010

> -----Original Message-----
> From: Peter Dalgaard [mailto:pdalgd at gmail.com] 
> Sent: Sunday, June 20, 2010 2:12 PM
> To: William Dunlap
> Cc: Patrick Burns; r-devel at r-project.org
> Subject: Re: [Rd] proposed change to 'sample'
> 
> William Dunlap wrote:
> >> -----Original Message-----
> >> From: r-devel-bounces at r-project.org 
> >> [mailto:r-devel-bounces at r-project.org] On Behalf Of Patrick Burns
> ....
> >>
> >> I propose adding an argument that allows
> >> the user (programmer) to avoid that
> >> ambiguity:
> >>
> >> function (x, size, replace = FALSE, prob = NULL,
> >>      max = length(x) == 1L && is.numeric(x) && x >= 1)
> > 
> > S+'s sample() has an argument 'n' to achieve
> > the same result.  It has been there since at
> > least 2005 (S+ 7.0.6).  sample(n=n) means to
> > return a sample from seq_along(n), where n must
> > be a scalar nonnegative integer.  sample(x=x)
> > retains it old ambiguous meaning.
> >   sample(x, size = n, replace = F, prob = NULL, n = NULL, ...)
> 
> Hmm, that doesn't really solve the issue does it? I.e., you 
> still cannot
> conveniently sample from a vector that is possibly of size 1.
> 
> I would be more inclined to make sampling from a vector the 
> normal case,
> and default x to say 1:max(n, size), forcing users to say 
> sample(n=5) if
> sampling from x=1:5 is desired. This could be a manageable change; the
> deprecation sequence is a bit painful to think through, though.

I think that the breaking of old code was why we
allowed the user to use an unambiguous sample(n=n),
but didn't change how sample(x=scalar) worked.
Internally, we had long discouraged using sample(x=vector)
because of the ambiguity problem, preferring
x[sample(length(x),...)].

I notice that S+'s rsample() does not allow sampling
from a vector, only from seq_len(n).  I think that
is because it was felt that sampling rows from a data.frame
(or the bigdata equivalent, bdframe) was a more common
operation and the code was simpler/faster if rsample didn't
have to call out to possible subscripting methods.  Relaxing
the requirement that the output be a randomly permuted
sample was a bigger requirement when dealing with long
datasets.

In any case, I was just stating that if sample were
changed to allow disambiguation of its first argument,
using 'n' instead of 'max' would be compatible with S+.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> -- 
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>