[R] merging-binning data

Wed Nov 4 15:33:59 CET 2015

Whatever approach is "best" to define subsets depends completely on the semantics of the data. Your approach (a fixed number of equally spaced breaks) is the right one if the absolute ranges of the data is important. It should be obvious that either the top or the bottom group could contain only a single element, and also that any or all of the intermediate groups could be empty. 

If you want to control the number of elements in your groups, use quantiles instead. 

Your application may require to define the breaks in other ways. The code I have given you doesn't generalize well, as it depends on the equal spacing of breaks. As I mentioned earlier, I would not store the groups at all - but would define a function that returns a vector of elements in the group, and in the function body I would clearly and explicitly define the conditions for group membership (and comment it). That is how you make code for a task like this explicit and _maintainable_.

Cheers,
Boris

On Nov 4, 2015, at 9:19 AM, Alaios <alaios at yahoo.com> wrote:

> Thanks everything is solved and I was even able to plot boxplots as needed.
> The only minor is that the max element falls in the last category and is only the single one element. Perhaps this can be from the way my data look like.
> Retgards
> Alex
> 
> 
> 
> On Wednesday, November 4, 2015 3:06 PM, Boris Steipe <boris.steipe at utoronto.ca> wrote:
> 
> 
> The breaks are just the min() and max() in your groups. Something like
> 
>   sprintf("[%5.2f,%5.2f]", min(dBin[groups==2]), max(dBin[groups==2]))
> 
> ... should achieve what you need.
> 
> 
> B.
> 
> 
> 
> On Nov 4, 2015, at 8:45 AM, Alaios <alaios at yahoo.com> wrote:
> 
> > you are right.
> > by labels I mean the "categories", "breaks" that my data fall in.
> > To be part of group 2 for example you have to be in the range of [110,223) I need to keep those for my plots.
> > 
> > Did I describe it more precisely now?
> > Alex
> > 
> > 
> > 
> > On Wednesday, November 4, 2015 2:09 PM, Boris Steipe <boris.steipe at utoronto.ca> wrote:
> > 
> > 
> > I don't understand: 
> > - where does the "label" come from? (It's not an element of your data that I see.)
> > - what do you want to do with this "label" i.e. how does it need to be associated with the data?
> > 
> > 
> > B.
> > 
> > 
> > 
> > On Nov 4, 2015, at 7:57 AM, Alaios <alaios at yahoo.com> wrote:
> > 
> > > Thanks it works great and gives me group numbers as integers and thus I can with which group the elements as needed (which (groups== 2))
> > > 
> > > Question though is how to keep also the labels for each group. For example that my first group is the [13,206)
> > > 
> > > Regards
> > > Alex
> > > 
> > > 
> > > 
> > > On Wednesday, November 4, 2015 1:00 PM, Boris Steipe <boris.steipe at utoronto.ca> wrote:
> > > 
> > > 
> > > I would transform the original numbers into integers which you can use as group labels. The row numbers of the group labels are the indexes of your values.
> > > 
> > > Example: assume your input vector is dBin
> > > 
> > > nGroups <- 5  # number of groups
> > > groups <- (dBin - min(dBin)) / (max(dBin) - min(dBin)) # rescale to the range [0,1]
> > > groups <- floor(groups * nGroups) + 1  # discretize to nGroups integers
> > > 
> > > Now you can eg. get the indices for group 2
> > > 
> > > groups[groups == 2]
> > > 
> > > Depending on the nature of your input data, it may be better to keep these groups in a column adjacent to your values, rather than in a separate vector, or even better to just calculate the groups on the fly in your downstream analysis with the approach given above in a function, rather than storing them at all. These are simple operations that should not add perceptibly to execution time.
> > > 
> > > Cheers,
> > > Boris
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > On Nov 4, 2015, at 6:40 AM, Alaios via R-help <r-help at r-project.org> wrote:
> > > 
> > > > Thanks for the answer. Split does not give me the indexes though but only in which group they fall in. I also need the index of the group. Is the first, the second .. group?Alex
> > > > 
> > > > 
> > > > 
> > > >    On Tuesday, November 3, 2015 5:05 PM, Ista Zahn <istazahn at gmail.com> wrote:
> > > > 
> > > > 
> > > > Probably
> > > > 
> > > > split(binDistance, test).
> > > > 
> > > > Best,
> > > > Ista
> > > > 
> > > > On Tue, Nov 3, 2015 at 10:47 AM, Alaios via R-help <r-help at r-project.org> wrote:
> > > >> Dear all,I am not exactly sure on what is the proper name of what I am trying to do.
> > > >> I have a vector that looks like
> > > >>  binDistance
> > > >>            [,1]
> > > >>  [1,] 238.95162
> > > >>  [2,] 143.08590
> > > >>  [3,]  88.50923
> > > >>  [4,] 177.67884
> > > >>  [5,] 277.54116
> > > >>  [6,] 342.94689
> > > >>  [7,] 241.60905
> > > >>  [8,] 177.81969
> > > >>  [9,] 211.25559
> > > >> [10,] 279.72702
> > > >> [11,] 381.95738
> > > >> [12,] 483.76363
> > > >> [13,] 480.98841
> > > >> [14,] 369.75241
> > > >> [15,] 267.73650
> > > >> [16,] 138.55959
> > > >> [17,] 137.93181
> > > >> [18,] 184.75200
> > > >> [19,] 254.64359
> > > >> [20,] 328.87785
> > > >> [21,] 273.15577
> > > >> [22,] 252.52830
> > > >> [23,] 252.52830
> > > >> [24,] 252.52830
> > > >> [25,] 262.20084
> > > >> [26,] 314.93064
> > > >> [27,] 366.02996
> > > >> [28,] 442.77467
> > > >> [29,] 521.20323
> > > >> [30,] 465.33071
> > > >> [31,] 366.60582
> > > >> [32,]  13.69540
> > > >> so numbers that start from 13 and go up to maximum 522 (I have also many other similar sets).I want to put these numbers into 5 categories and thus I have tried cut
> > > >> 
> > > >> 
> > > >> Browse[2]> test<-cut(binDistance,seq(min(binDistance)-0.00001,max(binDistance),length.out=scaleLength+1))
> > > >> Browse[2]> test
> > > >>  [1] (217,318]  (115,217]  (13.7,115] (115,217]  (217,318]  (318,420]
> > > >>  [7] (217,318]  (115,217]  (115,217]  (217,318]  (318,420]  (420,521]
> > > >> [13] (420,521]  (318,420]  (217,318]  (115,217]  (115,217]  (115,217]
> > > >> [19] (217,318]  (318,420]  (217,318]  (217,318]  (217,318]  (217,318]
> > > >> [25] (217,318]  (217,318]  (318,420]  (420,521]  (420,521]  (420,521]
> > > >> [31] (318,420]  (13.7,115]
> > > >> Levels: (13.7,115] (115,217] (217,318] (318,420] (420,521]
> > > >> 
> > > >> 
> > > >> I want then for the numbers of my initial vector that fall within the same "category" lets say the (318,420] to be collected on a vector.I rephrase it the indexes of my initial vector that have a value between 318 to 420 to be put in a same vector that I can process then as I want.
> > > >> How I can do that effectively in R?
> > > >> I would like to thank you for your replyRegardsAlex
> > > >> 
> > > >>        [[alternative HTML version deleted]]
> > > >> 
> > > >> ______________________________________________
> > > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > >> and provide commented, minimal, self-contained, reproducible code.
> > > > 
> > > > 
> > > >    [[alternative HTML version deleted]]
> > > > 
> > > > ______________________________________________
> > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
> > > 
> > > 
> > 
> > 
> 
>