[R] Binning question (binning rows of a data.frame according to a variable)
Adaikalavan Ramasamy
ramasamy at cancer.org.uk
Mon Mar 20 13:40:40 CET 2006
Are you saying that your data might look like this ?
set.seed(1) # For reproducibility only - remove this
mydf <- data.frame( age=round(runif(100, min=5, max=65), digits=1),
nred=rpois(100, lambda=10),
nblue=rpois(100, lambda=5),
ngreen=rpois(100, lambda=15) )
mydf$total <- rowSums( mydf[ , c("nred", "nblue", "ngreen")] )
head(mydf)
age nred nblue ngreen total
1 20.9 11 7 15 33
2 27.3 8 2 18 28
3 39.4 11 4 8 23
4 59.5 6 5 8 19
5 17.1 10 3 16 29
6 58.9 11 5 14 30
If so, then try this :
mydf <- mydf[order(mydf$age), ] ## re-order by age
mydf$cumtotal <- cumsum(mydf$total) ## cummulative total
brk.pts <- seq(from=0, to=sum(mydf$total), len=9)
mydf$grp <- cut( mydf$cumtotal , brk.pts, labels=F )
age nred nblue ngreen total cumtotal grp
27 5.8 9 5 8 22 22 1
47 6.4 6 5 13 24 46 1
92 8.5 8 4 18 30 76 1
10 8.7 12 5 8 25 101 1
55 9.2 10 7 13 30 131 1
69 10.1 9 3 18 30 161 1
So here your 'grp' column is what you really want. Just to check
tapply( mydf$total, mydf$grp, sum )
1 2 3 4 5 6 7 8
352 363 372 387 358 377 377 370
sapply( tapply( mydf$age, mydf$grp, range ), c )
1 2 3 4 5 6 7 8
[1,] 5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
[2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5
The last command says that your youngest student in group 1 is aged 5.8
and oldest is aged 16.2.
Taking this one step further, you can calculate the proportion of the
red, green and blue for each of the 8 groups.
props <- mydf[ , c("nred", "nblue", "ngreen")]/mydf$total # proportions
apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
nred nblue ngreen
1 0.3459898 0.1776441 0.4763661
2 0.3280712 0.1730796 0.4988492
3 0.3061429 0.1748149 0.5190422
4 0.3759380 0.2084694 0.4155926
5 0.3548805 0.1587353 0.4863842
6 0.3106835 0.1829349 0.5063816
7 0.3525933 0.1599737 0.4874330
8 0.3133796 0.1795567 0.5070637
Hope this of some use.
Regards, Adai
On Sun, 2006-03-19 at 18:58 +0000, Dan Bolser wrote:
> Adaikalavan Ramasamy wrote:
> > Do you by any chance want to sample from each group equally to get an
> > equal representation matrix ?
>
> No.
>
> I want to make groups of equal sizes, where size isn't simply number of
> rows (allowing a simple 'gl'), but a sum of the variable.
>
> Thanks for the code though, it looks useful.
>
>
>
> Here is an analogy for what I want to do (in case it helps).
>
> A group of students have some bags of marbles - The marbles have
> different colours. Each student has one bag, but can have between 5 and
> 50 marbles per bag with any given strange distribution you like. I line
> the students up by age, and want to see if there is any systematic
> difference between the number of each color of marble by age (older
> students may find primary colours less 'cool').
>
> Because the statistics of each individual student are bad (like the
> proportion of each color per student -- has a high variance) I first put
> all the students into 8 groups (for example).
>
> Thing is, for one reason or another, the number of marbles per bag may
> systematically vary with age too. However, I am not interested in the
> number of marbles per bag, so I would like to group the students into 8
> groups such that each group has the same total number of marbles. (Each
> group having a different sized age range, none the less ordered by age).
>
> Then I can look at the proportion (or count) of colours in each group,
> and I can compare the groups or any trend accross the groups.
>
> Does that make sense?
>
> Cheers,
> Dan.
>
>
>
>
>
>
> > Here is an example of the input :
> >
> > mydf <- data.frame( value=1:100, value2=rnorm(100),
> > grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )
> >
> > which has 35 observations from A, 15 from B, 30 from C and 20 from D.
> >
> >
> > And here is a function that I wrote:
> >
> > sample.by.group <- function(df, grp, k, replace=FALSE){
> >
> > if(length(k)==1){ k <- rep(k, length(unique(grp))) }
> >
> > if(!replace && any(k > table(grp)))
> > stop( paste("Cannot take a sample larger than the population when
> > 'replace = FALSE'.\n", "Please specify a value greater than",
> > min(table(grp)), "or use 'replace = TRUE'.\n") )
> >
> >
> > ind <- model.matrix( ~ -1 + grp )
> > w.mat <- list(NULL)
> >
> > for(i in 1:ncol(ind)){
> > w.mat[[i]] <- sample( which( ind[,i]==1 ), k[i], replace=replace )
> > }
> >
> > out <- df[ unlist(w.mat), ]
> > return(out)
> > }
> >
> >
> > And here are some examples of how to use it :
> >
> > mydf <- mydf[ sample(1:nrow(mydf)), ] # scramble it for fun
> >
> >
> > out1 <- sample.by.group(mydf, mydf$grp, k=10 )
> > table( out1$grp )
> >
> > out2 <- sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap
> > table( out2$grp )
> >
> > and you can even do bootstrapping or sampling with weights via:
> >
> > out3 <- sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T)
> > table( out3$grp )
> >
> >
> > Regards, Adai
> >
> >
> >
> > On Fri, 2006-03-17 at 16:01 +0000, Dan Bolser wrote:
> >
> >>Hi,
> >>
> >>I have tuples of data in rows of a data.frame, each column is a variable
> >>for the 'items' (one per row).
> >>
> >>One of the variables is the 'size' of the item (row).
> >>
> >>I would like to cut my data.frame into groups such that each group has
> >>the same *total size*. So, assuming that we order by size, some groups
> >>should have several small items while other groups have a few large
> >>items. All the groups should have approximately the same total size.
> >>
> >>I have tried various combinations of cut, quantile, and ecdf, and I just
> >>can't work out how to do this!
> >>
> >>Any help is greatly appreciated!
> >>
> >>All the best,
> >>Dan.
> >>
> >>______________________________________________
> >>R-help at stat.math.ethz.ch mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> >>
> >
> >
>
>
More information about the R-help
mailing list