[R] Binning question (binning rows of a data.frame according to a variable)
Dan Bolser
dmb at mrc-dunn.cam.ac.uk
Mon Mar 20 16:11:01 CET 2006
Adaikalavan Ramasamy wrote:
> Are you saying that your data might look like this ?
>
> set.seed(1) # For reproducibility only - remove this
> mydf <- data.frame( age=round(runif(100, min=5, max=65), digits=1),
> nred=rpois(100, lambda=10),
> nblue=rpois(100, lambda=5),
> ngreen=rpois(100, lambda=15) )
> mydf$total <- rowSums( mydf[ , c("nred", "nblue", "ngreen")] )
>
> head(mydf)
> age nred nblue ngreen total
> 1 20.9 11 7 15 33
> 2 27.3 8 2 18 28
> 3 39.4 11 4 8 23
> 4 59.5 6 5 8 19
> 5 17.1 10 3 16 29
> 6 58.9 11 5 14 30
>
>
> If so, then try this :
>
> mydf <- mydf[order(mydf$age), ] ## re-order by age
> mydf$cumtotal <- cumsum(mydf$total) ## cummulative total
>
> brk.pts <- seq(from=0, to=sum(mydf$total), len=9)
> mydf$grp <- cut( mydf$cumtotal , brk.pts, labels=F )
>
> age nred nblue ngreen total cumtotal grp
> 27 5.8 9 5 8 22 22 1
> 47 6.4 6 5 13 24 46 1
> 92 8.5 8 4 18 30 76 1
> 10 8.7 12 5 8 25 101 1
> 55 9.2 10 7 13 30 131 1
> 69 10.1 9 3 18 30 161 1
>
>
> So here your 'grp' column is what you really want. Just to check
>
> tapply( mydf$total, mydf$grp, sum )
> 1 2 3 4 5 6 7 8
> 352 363 372 387 358 377 377 370
>
> sapply( tapply( mydf$age, mydf$grp, range ), c )
> 1 2 3 4 5 6 7 8
> [1,] 5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
> [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5
>
> The last command says that your youngest student in group 1 is aged 5.8
> and oldest is aged 16.2.
>
>
> Taking this one step further, you can calculate the proportion of the
> red, green and blue for each of the 8 groups.
>
> props <- mydf[ , c("nred", "nblue", "ngreen")]/mydf$total # proportions
> apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
> nred nblue ngreen
> 1 0.3459898 0.1776441 0.4763661
> 2 0.3280712 0.1730796 0.4988492
> 3 0.3061429 0.1748149 0.5190422
> 4 0.3759380 0.2084694 0.4155926
> 5 0.3548805 0.1587353 0.4863842
> 6 0.3106835 0.1829349 0.5063816
> 7 0.3525933 0.1599737 0.4874330
> 8 0.3133796 0.1795567 0.5070637
>
> Hope this of some use.
Yes, this is very useful! I have just one remaining question, above you
take the mean of the group proportion...
apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
instead of explicitly recalculating the proportion for the group (what I
couldn't script real good) ...
rbind(
colSums(mydf[ mydf$grp==1, c("nred", "nblue", "ngreen")])/
sum (mydf[ mydf$grp==1, c("nred", "nblue", "ngreen")]),
...
colSums(mydf[ mydf$grp==8, c("nred", "nblue", "ngreen")])/
sum (mydf[ mydf$grp==8, c("nred", "nblue", "ngreen")])
)
Giving (from the same seed)...
nred nblue ngreen
[1,] 0.3465909 0.1704545 0.4829545
[2,] 0.3250689 0.1735537 0.5013774
[3,] 0.3064516 0.1774194 0.5161290
[4,] 0.3746770 0.2067183 0.4186047
[5,] 0.3519553 0.1564246 0.4916201
[6,] 0.3103448 0.1830239 0.5066313
[7,] 0.3501326 0.1644562 0.4854111
[8,] 0.3081081 0.1837838 0.5081081
Which is *slightly* different from the 'mean' approach.
> round(former-latter,4)
nred nblue ngreen
1 -0.0006 0.0072 -0.0066
2 0.0030 -0.0005 -0.0025
3 -0.0003 -0.0026 0.0029
4 0.0013 0.0018 -0.0030
5 0.0029 0.0023 -0.0052
6 0.0003 -0.0001 -0.0002
7 0.0025 -0.0045 0.0020
8 0.0053 -0.0042 -0.0010
I know this less a question about R, and more a question about general
stats, but why did you choose the former and not the latter method? Is
one wrong and one right? Or did the former better fit the situation as
described?
Thanks for any insight into your decision, as this is something that has
always puzzled me.
Thanks for the beautifully clear examples!
Dan.
>
> Regards, Adai
>
>
>
> On Sun, 2006-03-19 at 18:58 +0000, Dan Bolser wrote:
>
>>Adaikalavan Ramasamy wrote:
>>
>>>Do you by any chance want to sample from each group equally to get an
>>>equal representation matrix ?
>>
>>No.
>>
>>I want to make groups of equal sizes, where size isn't simply number of
>>rows (allowing a simple 'gl'), but a sum of the variable.
>>
>>Thanks for the code though, it looks useful.
>>
>>
>>
>>Here is an analogy for what I want to do (in case it helps).
>>
>>A group of students have some bags of marbles - The marbles have
>>different colours. Each student has one bag, but can have between 5 and
>>50 marbles per bag with any given strange distribution you like. I line
>>the students up by age, and want to see if there is any systematic
>>difference between the number of each color of marble by age (older
>>students may find primary colours less 'cool').
>>
>>Because the statistics of each individual student are bad (like the
>>proportion of each color per student -- has a high variance) I first put
>>all the students into 8 groups (for example).
>>
>>Thing is, for one reason or another, the number of marbles per bag may
>>systematically vary with age too. However, I am not interested in the
>>number of marbles per bag, so I would like to group the students into 8
>>groups such that each group has the same total number of marbles. (Each
>>group having a different sized age range, none the less ordered by age).
>>
>>Then I can look at the proportion (or count) of colours in each group,
>>and I can compare the groups or any trend accross the groups.
>>
>>Does that make sense?
>>
>>Cheers,
>>Dan.
>>
>>
>>
>>
>>
>>
>>
>>>Here is an example of the input :
>>>
>>> mydf <- data.frame( value=1:100, value2=rnorm(100),
>>> grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )
>>>
>>>which has 35 observations from A, 15 from B, 30 from C and 20 from D.
>>>
>>>
>>>And here is a function that I wrote:
>>>
>>> sample.by.group <- function(df, grp, k, replace=FALSE){
>>>
>>> if(length(k)==1){ k <- rep(k, length(unique(grp))) }
>>>
>>> if(!replace && any(k > table(grp)))
>>> stop( paste("Cannot take a sample larger than the population when
>>> 'replace = FALSE'.\n", "Please specify a value greater than",
>>> min(table(grp)), "or use 'replace = TRUE'.\n") )
>>>
>>>
>>> ind <- model.matrix( ~ -1 + grp )
>>> w.mat <- list(NULL)
>>>
>>> for(i in 1:ncol(ind)){
>>> w.mat[[i]] <- sample( which( ind[,i]==1 ), k[i], replace=replace )
>>> }
>>>
>>> out <- df[ unlist(w.mat), ]
>>> return(out)
>>> }
>>>
>>>
>>>And here are some examples of how to use it :
>>>
>>>mydf <- mydf[ sample(1:nrow(mydf)), ] # scramble it for fun
>>>
>>>
>>>out1 <- sample.by.group(mydf, mydf$grp, k=10 )
>>>table( out1$grp )
>>>
>>> out2 <- sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap
>>> table( out2$grp )
>>>
>>>and you can even do bootstrapping or sampling with weights via:
>>>
>>> out3 <- sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T)
>>> table( out3$grp )
>>>
>>>
>>>Regards, Adai
>>>
>>>
>>>
>>>On Fri, 2006-03-17 at 16:01 +0000, Dan Bolser wrote:
>>>
>>>
>>>>Hi,
>>>>
>>>>I have tuples of data in rows of a data.frame, each column is a variable
>>>>for the 'items' (one per row).
>>>>
>>>>One of the variables is the 'size' of the item (row).
>>>>
>>>>I would like to cut my data.frame into groups such that each group has
>>>>the same *total size*. So, assuming that we order by size, some groups
>>>>should have several small items while other groups have a few large
>>>>items. All the groups should have approximately the same total size.
>>>>
>>>>I have tried various combinations of cut, quantile, and ecdf, and I just
>>>>can't work out how to do this!
>>>>
>>>>Any help is greatly appreciated!
>>>>
>>>>All the best,
>>>>Dan.
>>>>
>>>>______________________________________________
>>>>R-help at stat.math.ethz.ch mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>>>
>>>
>>>
>>
>
More information about the R-help
mailing list