[R] sampling from data frame

Bill.Venables@cmis.csiro.au Bill.Venables at cmis.csiro.au
Thu Jun 6 09:57:39 CEST 2002


Maria Wolters asks:

>  -----Original Message-----
> From: 	Maria Wolters [mailto:maria at rhetorical.com] 
> Sent:	Thursday, June 06, 2002 5:24 PM
> To:	R-help Digest
> Cc:	r-help-digest at stat.math.ethz.ch
> Subject:	[R] sampling from data frame 
> 
> 
> Hello,
> 
> after searching through the archives and
> not finding a thread that answers this question,
> I thought I'd pass it on to the list.
> 
> Given a data frame and given a factor variable
> that assigns a class to each case in the data frame,
> what is the most efficient way to sample
> a given number of cases from each class?
	[WNV]  Not clear what you mean.  Let me take a stab.  Suppose the
data frame is Dat and the factor is G.  Furthermore suppose the classes are
G1, G2, ..., Gm and the vector k tells you how many you want from each
class, k[1] from G1, ... , k[m] from Gm.

	Here is a way of sampling without replacement in this way, but I
would not say it is necessarily the most efficient:

		bits <- split( 1:nrow(Dat), Dat$G)  # find the indices for
each class
		wh <- sapply(1:m, function(x, k) sample(bits[x], k[x]), k =
k)  # Pick the samples
		DatSample <- Dat[wh, ]

	This gives you the sample as a data frame consisting of the selected
rows of Dat.  I'm not all that convinced that you need to do the picking
using sapply, in fact.  Personally I'd probably just use a loop that I
didn't have to think about too hard:

		wh <- list( )
		for(j in 1:m) wh[[j]] <- sample(bits[j], k[j])
		wh <- unlist(wh)

	If you want a constant number from each class, this is a bit
simpler.


> I've found a roundabout solution that works as follows:
> for each class:
>     assign unique index to each class member
>     chosen_cases <-  sample(n,indexvariable)
>     extract chosen_cases from data frame
>     (i.e. chosen <- subset(data, indexvariable %in% chosen_cases))
	[WNV]  I know this is meta-code, but using _ in names can be a bit
ambiguous...

> this solution relies on the Hmisc library and is
> horribly inefficient. Any ideas on how to make it better
> would be greatly appreciated.
> 
> Best from Edinburgh,
	[WNV]  Same from Brisbane, where I suspect the temperature is
getting close to that in Edinburgh right now.  In July you might just have
that edge, though... :-)

	Bill Venables	

> Maria
> 
> -- 
> Maria Wolters		maria.wolters
> Development Engineer    AT
> Rhetorical Systems Ltd. rhetorical.com
> 		   Edinburgh
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-
> r-help mailing list -- Read
> http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list