[R] Subsetting by number of observations in a factor

Fri Aug 10 04:35:22 CEST 2007

Does this do what you want?  It creates a new dataframe with those
'mg' that have at least a certain number of observation.

> set.seed(2)
> # create some test data
> x <- data.frame(mg=sample(LETTERS[1:4], 20, TRUE), data=1:20)
> # split the data into subsets based on 'mg'
> x.split <- split(x, x$mg)
> str(x.split)
List of 4
 $ A:'data.frame':      7 obs. of  2 variables:
  ..$ mg  : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1
  ..$ data: int [1:7] 1 4 7 12 14 18 20
 $ B:'data.frame':      3 obs. of  2 variables:
  ..$ mg  : Factor w/ 4 levels "A","B","C","D": 2 2 2
  ..$ data: int [1:3] 9 15 19
 $ C:'data.frame':      4 obs. of  2 variables:
  ..$ mg  : Factor w/ 4 levels "A","B","C","D": 3 3 3 3
  ..$ data: int [1:4] 2 3 10 11
 $ D:'data.frame':      6 obs. of  2 variables:
  ..$ mg  : Factor w/ 4 levels "A","B","C","D": 4 4 4 4 4 4
  ..$ data: int [1:6] 5 6 8 13 16 17
> # only choose subsets with at 5 observations
> x.5 <- lapply(x.split, function(a) {
+     if (nrow(a) >= 5) return(a)
+     else return(NULL)
+ })
> # create new dataframe with these observations
> x.new <- do.call('rbind', x.5)
> x.new
     mg data
A.1   A    1
A.4   A    4
A.7   A    7
A.12  A   12
A.14  A   14
A.18  A   18
A.20  A   20
D.5   D    5
D.6   D    6
D.8   D    8
D.13  D   13
D.16  D   16
D.17  D   17
>
>

On 8/9/07, Ron Crump <ron.crump at une.edu.au> wrote:
> Hi,
>
> I generally do my data preparation externally to R, so I
> this is a bit unfamiliar to me, but a colleague has asked
> me how to do certain data manipulations within R.
>
> Anyway, basically I can get his large file into a dataframe.
> One of the columns is a management group code (mg). There may be
> varying numbers of observations per management group, and
> he would like to subset the dataframe such that there are
> always at least n per management group.
>
> I presume I can get to this using table or tapply, then
> (and I'm not sure how on this bit) creating a column nmg
> containing the number of observations that corresponds to
> mg for that row, then simply subsetting.
>
> So, am I on the right track? If so how do I actually do it, and
> is there an easier method than I am considering.
>
> Thanks for your help,
> Ron
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?