[R] sampling dataframe based upon number of record occurrences

Curtis Burkhalter curtisburkhalter at gmail.com
Wed Mar 4 19:56:55 CET 2015


That worked great, thanks so much David!

On Wed, Mar 4, 2015 at 8:23 AM, David L Carlson <dcarlson at tamu.edu> wrote:

> I'm not sure I understand, but I think you have a large data frame with
> records and you want to construct a sample of that data frame that includes
> no more than 3 records for each IDbyYear combination? You say there are
> 5589 unique combinations and your code uses a data frame called
> fitting_set. Assuming this is the data frame you are describing, your code
> will select all of the lines since fitting_set$IDbyYear[i] is always a
> vector of length 1.
>
> We need a reproducible example. The best way for you to give us that would
> be to copy the result of dput(head(fitting_set, 10)). It would look
> something like this plus the 6 other columns you mention except that I've
> added dta <- in front of structure() to create a data frame:
>
> dta <- structure(list(IDbyYear = c(42.24, 42.24, 42.24, 42.24, 42.24,
> 42.24, 45.32, 45.32, 45.36, 45.4, 45.4), SiteID = structure(c(1L,
> 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("A-Airport",
> "A-Bark Corral East"), class = "factor"), Year = c(2006L, 2006L,
> 2006L, 2006L, 2006L, 2006L, 2008L, 2008L, 2009L, 2010L, 2010L
> )), .Names = c("IDbyYear", "SiteID", "Year"), class = "data.frame",
> row.names = c(NA,
> -11L))
>
> Now create a list of data frames, one for each IDbyYear:
>
> dta.list <- split(dta, dta$IDbyYear)
>
> Now a function that will select 3 rows or all of them if there are fewer:
>
> smp <- function(dframe) {
>         ind <- seq_len(nrow(dframe))
>         dframe[sample(ind, ifelse(length(ind)>2, 3, length(ind))),]
> }
>
> Now take the samples and combine them into a single data frame:
>
> sample <- do.call(rbind, lapply(dta.list, smp))
> sample
>
> -------------------------------------
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
>
>
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Curtis
> Burkhalter
> Sent: Tuesday, March 3, 2015 3:23 PM
> To: r-help at r-project.org
> Subject: [R] sampling dataframe based upon number of record occurrences
>
> Hello everyone,
>
> I'm having trouble performing a task that is probably very simple, but
> can't seem to figure out how to get my code to work. What I want to do is
> use the sample function to pick records within in a dataframe, but only if
> a column attribute value is repeated more than 3 times. So if you look at
> the data below I have created a unique attribute value that corresponds to
> every site by year combination (i.e. IDxYear). So you can see that for the
> site called "A-Airport" it was sampled 6 times in 2006, "A-Bank Corral
> East" was sampled twice in 2008. So what I want to do is randomly select 3
> records for "A-Airport" in 2006 for the existing 6 records, but for "A-Bark
> Corral East" in 2008 I just want to leave these records as they currently
> are.
>
> I've used the following code to try and  accomplish this, but like I said I
> can't get it to work so I'm clearly doing something wrong. If you could
> check out the code and provide any suggestions that would be great. It
> should be noted that there are 5589 unique IDxYear combinations so that's
> why that number is in the code. If any further clarification is needed also
> let me know.
>
> boom=data.frame()
> for (i in 1:5589){
>
>
> boom[i,]=ifelse(length(fitting_set$IDbyYear[i]>3),fitting_set[sample(nrow(fitting_set),3),],fitting_set)
>
> }
> boom
>
>
>               *IDbyYear*           *SiteID *                  *Year*
>  *6 other column attributes*
>               42.24               A-Airport                 2006
>              42.24               A-Airport                 2006
>               42.24               A-Airport                 2006
>              42.24               A-Airport                 2006
>               42.24               A-Airport                 2006
>              42.24               A-Airport                 2006
>              45.32              A-Bark Corral East    2008
>              45.32              A-Bark Corral East    2008
>              45.36              A-Bark Corral East    2009
>              45.40              A-Bark Corral East    2010
>              45.40               A-Bark Corral East   2010
>
>  Thanks
>
>
> --
> Curtis Burkhalter
>
> https://sites.google.com/site/curtisburkhalter/
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Curtis Burkhalter

https://sites.google.com/site/curtisburkhalter/

	[[alternative HTML version deleted]]



More information about the R-help mailing list