[R] Random selection from a subsample

David Winsemius dwinsemius at comcast.net
Sun Dec 19 13:47:41 CET 2010

On Dec 19, 2010, at 5:31 AM, Tom Wilding wrote:

> Dear Mailing List
> I have a data set (data4) consisting of a number of factors and a  
> response variable.  I wish to randomly sample from a combination of  
> two of those factors (GIS_station and Distance_code2) and return a  
> new dataframe containing the original data structure (i.e. all the  
> columns) but only containing the randomly selected rows.  The number  
> of rows in each combination of GIS_station and Distance_code2 vary  
> (widely) and some combinations are absent.
> This is getting there::
> with (data4,{
> sub_sample10=by(data4,list(GIS_station,Distance_code2), function(x)  
> {sample(1:nrow(x),10,replace=T)})
> })
> ....but just generates two random numbers from the range 1:nrow(x).

Only 2? Your argument to sample is 10.

> It doesn't return the selected rows, which is what I want.

And those row numbers would not refer to the order in the original  
sample either but would be referring within the . You have not yet  
done a very good job of specifying what sampling strategy is needed.  
At the moment you seem to be working toward a strategy that would  
potentially be very uneven in terms of the probabilities that members  
of different combinations would get into the sample, since the number  
being chosen is fixed and the number to be chosen from "varies  
widely". Is that really what you want?

> I'm sure I could this could be done in an elegant manner, using a  
> subscript e.g.
> sub_sample10 = data4 [sample (1:nrow (data4), size=10), ]

(You also have not provided a reproducible data example. Next time  
bring data.)

Theis works to sample 3 from each of the the distinct categories in  
the warpbreaks data object:

by(warpbreaks, list(warpbreaks$wool, warpbreaks$tension),  
FUN=function(x) x[sample(1:nrow(x), 3), ] )   #returns a list with 6  
members each of which has a three row dataframe

And this would stick them back together in on dataframe:

  do.call(rbind, by(warpbreaks, list(warpbreaks$wool, warpbreaks 
$tension), FUN=function(x) x[sample(1:nrow(x), 3), ] ) )


> only somehow combining it with the 'by' statement (e.g. by (data4,  
> list (GIS_station, Distance_code2).......)) but I cannot get this to  
> work.
> Any guidance on this much appreciated.
> Thankyou.

David Winsemius, MD
West Hartford, CT

More information about the R-help mailing list