[R] Binning question (binning rows of a data.frame according to a variable)

Adaikalavan Ramasamy ramasamy at cancer.org.uk
Mon Mar 20 13:40:40 CET 2006


Are you saying that your data might look like this ?

 set.seed(1)  # For reproducibility only - remove this
 mydf <- data.frame( age=round(runif(100, min=5, max=65), digits=1),
                     nred=rpois(100, lambda=10), 
                     nblue=rpois(100, lambda=5), 
                     ngreen=rpois(100, lambda=15) )
 mydf$total <- rowSums( mydf[ , c("nred", "nblue", "ngreen")] )

 head(mydf)
    age nred nblue ngreen total
 1 20.9   11     7     15    33
 2 27.3    8     2     18    28
 3 39.4   11     4      8    23
 4 59.5    6     5      8    19
 5 17.1   10     3     16    29
 6 58.9   11     5     14    30


If so, then try this :

 mydf          <- mydf[order(mydf$age), ]  ## re-order by age
 mydf$cumtotal <- cumsum(mydf$total)       ## cummulative total

 brk.pts       <- seq(from=0, to=sum(mydf$total), len=9)
 mydf$grp      <- cut( mydf$cumtotal , brk.pts, labels=F )

     age nred nblue ngreen total cumtotal grp
 27  5.8    9     5      8    22       22   1
 47  6.4    6     5     13    24       46   1
 92  8.5    8     4     18    30       76   1
 10  8.7   12     5      8    25      101   1
 55  9.2   10     7     13    30      131   1
 69 10.1    9     3     18    30      161   1


So here your 'grp' column is what you really want. Just to check 

 tapply( mydf$total, mydf$grp, sum )
   1   2   3   4   5   6   7   8 
 352 363 372 387 358 377 377 370 

 sapply( tapply( mydf$age, mydf$grp, range ), c )
         1    2    3    4    5    6    7    8
 [1,]  5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
 [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5

The last command says that your youngest student in group 1 is aged 5.8
and oldest is aged 16.2.


Taking this one step further, you can calculate the proportion of the
red, green and blue for each of the 8 groups.

 props <- mydf[ , c("nred", "nblue", "ngreen")]/mydf$total # proportions
 apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
        nred     nblue    ngreen
 1 0.3459898 0.1776441 0.4763661
 2 0.3280712 0.1730796 0.4988492
 3 0.3061429 0.1748149 0.5190422
 4 0.3759380 0.2084694 0.4155926
 5 0.3548805 0.1587353 0.4863842
 6 0.3106835 0.1829349 0.5063816
 7 0.3525933 0.1599737 0.4874330
 8 0.3133796 0.1795567 0.5070637

Hope this of some use.

Regards, Adai



On Sun, 2006-03-19 at 18:58 +0000, Dan Bolser wrote:
> Adaikalavan Ramasamy wrote:
> > Do you by any chance want to sample from each group equally to get an
> > equal representation matrix ? 
> 
> No.
> 
> I want to make groups of equal sizes, where size isn't simply number of 
> rows (allowing a simple 'gl'), but a sum of the variable.
> 
> Thanks for the code though, it looks useful.
> 
> 
> 
> Here is an analogy for what I want to do (in case it helps).
> 
> A group of students have some bags of marbles - The marbles have 
> different colours. Each student has one bag, but can have between 5 and 
> 50 marbles per bag with any given strange distribution you like. I line 
> the students up by age, and want to see if there is any systematic 
> difference between the number of each color of marble by age (older 
> students may find primary colours less 'cool').
> 
> Because the statistics of each individual student are bad (like the 
> proportion of each color per student -- has a high variance) I first put 
> all the students into 8 groups (for example).
> 
> Thing is, for one reason or another, the number of marbles per bag may 
> systematically vary with age too. However, I am not interested in the 
> number of marbles per bag, so I would like to group the students into 8 
> groups such that each group has the same total number of marbles. (Each 
> group having a different sized age range, none the less ordered by age).
> 
> Then I can look at the proportion (or count) of colours in each group, 
> and I can compare the groups or any trend accross the groups.
> 
> Does that make sense?
> 
> Cheers,
> Dan.
> 
> 
> 
> 
> 
> 
> > Here is an example of the input :
> > 
> >  mydf <- data.frame( value=1:100, value2=rnorm(100),
> >                      grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )
> > 
> > which has 35 observations from A, 15 from B, 30 from C and 20 from D.
> > 
> > 
> > And here is a function that I wrote:
> > 
> >  sample.by.group <- function(df, grp, k, replace=FALSE){
> > 
> >    if(length(k)==1){ k <- rep(k, length(unique(grp))) }
> >     
> >    if(!replace && any(k > table(grp)))
> >      stop( paste("Cannot take a sample larger than the population when
> >      'replace = FALSE'.\n", "Please specify a value greater than",
> >      min(table(grp)), "or use 'replace = TRUE'.\n") )
> > 
> >   
> >    ind   <- model.matrix( ~ -1 + grp )
> >    w.mat <- list(NULL)
> >    
> >    for(i in 1:ncol(ind)){
> >      w.mat[[i]] <- sample( which( ind[,i]==1 ), k[i], replace=replace )
> >    }
> >   
> >    out <- df[ unlist(w.mat), ]
> >    return(out)
> >  }
> > 
> > 
> > And here are some examples of how to use it :
> >  
> > mydf <- mydf[ sample(1:nrow(mydf)), ]   # scramble it for fun
> > 
> > 
> > out1 <- sample.by.group(mydf, mydf$grp, k=10 )
> > table( out1$grp )
> > 
> >  out2 <- sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap
> >  table( out2$grp )
> > 
> > and you can even do bootstrapping or sampling with weights via:
> > 
> >  out3 <- sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T)
> >  table( out3$grp )
> > 
> > 
> > Regards, Adai
> > 
> > 
> > 
> > On Fri, 2006-03-17 at 16:01 +0000, Dan Bolser wrote:
> > 
> >>Hi,
> >>
> >>I have tuples of data in rows of a data.frame, each column is a variable 
> >>for the 'items' (one per row).
> >>
> >>One of the variables is the 'size' of the item (row).
> >>
> >>I would like to cut my data.frame into groups such that each group has 
> >>the same *total size*. So, assuming that we order by size, some groups 
> >>should have several small items while other groups have a few large 
> >>items. All the groups should have approximately the same total size.
> >>
> >>I have tried various combinations of cut, quantile, and ecdf, and I just 
> >>can't work out how to do this!
> >>
> >>Any help is greatly appreciated!
> >>
> >>All the best,
> >>Dan.
> >>
> >>______________________________________________
> >>R-help at stat.math.ethz.ch mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> >>
> > 
> > 
> 
>




More information about the R-help mailing list