[R] Binning numbers into integer-valued intervals (or: a version of cut or cut2 that makes sense)

Karl Ove Hufthammer karl at huftis.org
Mon Jul 25 16:00:46 CEST 2011


Dear list members,

I’m looking for a way to divide numbers into simple (i.e., integer-valued) 
intervals, and thought the ‘cut’ function in ‘base’ or the ‘cut2’ function 
in ‘Hmisc’ would, er, cut it. However, they seem to give rather surprising 
results.

Since I want the endpoints of the intervals to be integers, I used the 
‘dig.lab’ and ‘digits’ arguments. One assumption I made: If the number x 
gets the label (a, b], then x lies in the interval (a, b]. It turns out that 
this assumption was incorrect. Example:

$ cut(c(20.8, 21.3, 21.7, 23, 25), 2, dig.lab=1)
[1] (21,23] (21,23] (21,23] (23,25] (23,25]
Levels: (21,23] (23,25]

So the first number, 20.8, get put in the interval (21,23], which seem 
strange. I can see why this could happen, though, as perhaps the 20.8 is 
rounded to 21 before binning. But it’s even stranger that the *integer* 23 
is put in in the interval (23,25] instead of in the interval (21,23]. Can 
anyone explain why? 

I then turned to ‘cut2’ in ‘Hmisc’. But again I was surprised by the result:

$ cut2(c(20.8, 21.3, 21.7, 23), g=2, digits=1)
[1] [21,22) [21,22) [22,23] [22,23]
Levels: [21,22) [22,23]

Again 20.8 is placed in an interval that doesn’t mathematically contain it. 
And 21.3 and 21.7 are placed in *different* intervals, instead of both being 
placed in the interval [21,22). This may perhaps strictly not be a bug, but 
it’s certainly surprising behaviour!

Since obviously none of the two functions do what I require them to do, is 
there a different function that does, hidden deep inside some R package? 
This function should take as input a vector of numbers, and output a vector 
of non-overlapping (but ‘touching’) intervals with integer end-points so 
that each number is in exactly one interval. It should of course also 
include information on which interval each number belongs to.

Version information (though I also observe this on R 2.13.1 on Windows):

$ sessionInfo()
R version 2.13.1 Patched (2011-07-25 r56494)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=nn_NO.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=nn_NO.UTF-8        LC_COLLATE=nn_NO.UTF-8    
 [5] LC_MONETARY=C              LC_MESSAGES=nn_NO.UTF-8   
 [7] LC_PAPER=nn_NO.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Hmisc_3.8-3     survival_2.36-9

loaded via a namespace (and not attached):
[1] cluster_1.14.0  grid_2.13.1     lattice_0.19-30

-- 
Karl Ove Hufthammer



More information about the R-help mailing list