# [R] Binning question (binning rows of a data.frame according to a variable)

Mon Mar 20 13:40:40 CET 2006

```Are you saying that your data might look like this ?

set.seed(1)  # For reproducibility only - remove this
mydf <- data.frame( age=round(runif(100, min=5, max=65), digits=1),
nred=rpois(100, lambda=10),
nblue=rpois(100, lambda=5),
ngreen=rpois(100, lambda=15) )
mydf\$total <- rowSums( mydf[ , c("nred", "nblue", "ngreen")] )

age nred nblue ngreen total
1 20.9   11     7     15    33
2 27.3    8     2     18    28
3 39.4   11     4      8    23
4 59.5    6     5      8    19
5 17.1   10     3     16    29
6 58.9   11     5     14    30

If so, then try this :

mydf          <- mydf[order(mydf\$age), ]  ## re-order by age
mydf\$cumtotal <- cumsum(mydf\$total)       ## cummulative total

brk.pts       <- seq(from=0, to=sum(mydf\$total), len=9)
mydf\$grp      <- cut( mydf\$cumtotal , brk.pts, labels=F )

age nred nblue ngreen total cumtotal grp
27  5.8    9     5      8    22       22   1
47  6.4    6     5     13    24       46   1
92  8.5    8     4     18    30       76   1
10  8.7   12     5      8    25      101   1
55  9.2   10     7     13    30      131   1
69 10.1    9     3     18    30      161   1

So here your 'grp' column is what you really want. Just to check

tapply( mydf\$total, mydf\$grp, sum )
1   2   3   4   5   6   7   8
352 363 372 387 358 377 377 370

sapply( tapply( mydf\$age, mydf\$grp, range ), c )
1    2    3    4    5    6    7    8
[1,]  5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
[2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5

The last command says that your youngest student in group 1 is aged 5.8
and oldest is aged 16.2.

Taking this one step further, you can calculate the proportion of the
red, green and blue for each of the 8 groups.

props <- mydf[ , c("nred", "nblue", "ngreen")]/mydf\$total # proportions
apply( props, 2, function(v) tapply( v, mydf\$grp, mean ) )
nred     nblue    ngreen
1 0.3459898 0.1776441 0.4763661
2 0.3280712 0.1730796 0.4988492
3 0.3061429 0.1748149 0.5190422
4 0.3759380 0.2084694 0.4155926
5 0.3548805 0.1587353 0.4863842
6 0.3106835 0.1829349 0.5063816
7 0.3525933 0.1599737 0.4874330
8 0.3133796 0.1795567 0.5070637

Hope this of some use.

On Sun, 2006-03-19 at 18:58 +0000, Dan Bolser wrote:
> > Do you by any chance want to sample from each group equally to get an
> > equal representation matrix ?
>
> No.
>
> I want to make groups of equal sizes, where size isn't simply number of
> rows (allowing a simple 'gl'), but a sum of the variable.
>
> Thanks for the code though, it looks useful.
>
>
>
> Here is an analogy for what I want to do (in case it helps).
>
> A group of students have some bags of marbles - The marbles have
> different colours. Each student has one bag, but can have between 5 and
> 50 marbles per bag with any given strange distribution you like. I line
> the students up by age, and want to see if there is any systematic
> difference between the number of each color of marble by age (older
> students may find primary colours less 'cool').
>
> Because the statistics of each individual student are bad (like the
> proportion of each color per student -- has a high variance) I first put
> all the students into 8 groups (for example).
>
> Thing is, for one reason or another, the number of marbles per bag may
> systematically vary with age too. However, I am not interested in the
> number of marbles per bag, so I would like to group the students into 8
> groups such that each group has the same total number of marbles. (Each
> group having a different sized age range, none the less ordered by age).
>
> Then I can look at the proportion (or count) of colours in each group,
> and I can compare the groups or any trend accross the groups.
>
> Does that make sense?
>
> Cheers,
> Dan.
>
>
>
>
>
>
> > Here is an example of the input :
> >
> >  mydf <- data.frame( value=1:100, value2=rnorm(100),
> >                      grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )
> >
> > which has 35 observations from A, 15 from B, 30 from C and 20 from D.
> >
> >
> > And here is a function that I wrote:
> >
> >  sample.by.group <- function(df, grp, k, replace=FALSE){
> >
> >    if(length(k)==1){ k <- rep(k, length(unique(grp))) }
> >
> >    if(!replace && any(k > table(grp)))
> >      stop( paste("Cannot take a sample larger than the population when
> >      'replace = FALSE'.\n", "Please specify a value greater than",
> >      min(table(grp)), "or use 'replace = TRUE'.\n") )
> >
> >
> >    ind   <- model.matrix( ~ -1 + grp )
> >    w.mat <- list(NULL)
> >
> >    for(i in 1:ncol(ind)){
> >      w.mat[[i]] <- sample( which( ind[,i]==1 ), k[i], replace=replace )
> >    }
> >
> >    out <- df[ unlist(w.mat), ]
> >    return(out)
> >  }
> >
> >
> > And here are some examples of how to use it :
> >
> > mydf <- mydf[ sample(1:nrow(mydf)), ]   # scramble it for fun
> >
> >
> > out1 <- sample.by.group(mydf, mydf\$grp, k=10 )
> > table( out1\$grp )
> >
> >  out2 <- sample.by.group(mydf, mydf\$grp, k=50, replace=T) # ie bootstrap
> >  table( out2\$grp )
> >
> > and you can even do bootstrapping or sampling with weights via:
> >
> >  out3 <- sample.by.group(mydf, mydf\$grp, k=c(20, 20, 30, 30), replace=T)
> >  table( out3\$grp )
> >
> >
> >
> >
> >
> > On Fri, 2006-03-17 at 16:01 +0000, Dan Bolser wrote:
> >
> >>Hi,
> >>
> >>I have tuples of data in rows of a data.frame, each column is a variable
> >>for the 'items' (one per row).
> >>
> >>One of the variables is the 'size' of the item (row).
> >>
> >>I would like to cut my data.frame into groups such that each group has
> >>the same *total size*. So, assuming that we order by size, some groups
> >>should have several small items while other groups have a few large
> >>items. All the groups should have approximately the same total size.
> >>
> >>I have tried various combinations of cut, quantile, and ecdf, and I just
> >>can't work out how to do this!
> >>
> >>Any help is greatly appreciated!
> >>
> >>All the best,
> >>Dan.
> >>
> >>______________________________________________
> >>R-help at stat.math.ethz.ch mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-help