[R] Binning numbers into integer-valued intervals (or: a version of cut or cut2 that makes sense)

Karl Ove Hufthammer karl at huftis.org
Tue Jul 26 16:40:40 CEST 2011


William Dunlap wrote:

>> $ cut(c(20.8, 21.3, 21.7, 23, 25), 2, dig.lab=1)
>> [1] (21,23] (21,23] (21,23] (23,25] (23,25]
>> Levels: (21,23] (23,25]
>> 
>> So the first number, 20.8, get put in the interval (21,23], which seem
>> strange. I can see why this could happen, though, as perhaps the 20.8 is
>> rounded to 21 before binning. But it’s even stranger that the *integer*
>> 23 is put in in the interval (23,25] instead of in the interval (21,23].
>> Can anyone explain why?
> 
> dig.lab does not affect the choice of break points, it only
> affects how they are converted to character form for the labels.
> Unfortunately, cut() does not return the actual breakpoints but
> if you make them yourself you know what they are.
> 
> You need to find or make a function akin to pretty() that returns
> a "nice" set of breakpoints.  pretty() itself may do:

Thanks. pretty() is nice, but I miss the possibility of creating intervals
of approximately equal number of observations. cut2() in Hmisc can do this,
but then I still have the problem of the interval labels not corresponding
to the actual interval endpoints, which will likely be very confusing for
the consumer of the output data.

I ended up creating a new function, cut3(), to handle this. It tries to 
create intervals with approx. equal number of observations, where the
interval endpoints have a specified number of decimal digits (*not*
significant digits), and having the property that if a number x is put 
in the interval (a,b], then x > a & x ≤ b (and similar for the other 
half-open interval and for closed intervals).

It basically has the same interface as cut() (and uses cut() internally).
I have done some basic testing, and it seems to work fine, even with short
vectors and small number of intervals, but do expect bugs (and it really
should have better input validation).

Here’s the code, with some examples at the end:

####################################################################################

cut3 = function(x, ints = 5, rdig = 0, include.lowest = FALSE, right = TRUE, ...) {
	# Simple check to ensure the number of intervals is valid
	# (could add a lot of other error checking tests here!) 
	if (!is.numeric(ints) || is.na(ints) || length(ints) != 1 || ints < 2) 
		stop("invalid number of intervals")
	
	# Calculate quantiles (with fractional parts)
	breaks = quantile(x, probs = seq(0, 1, 1/(ints))) * 10^rdig
	
	# Round the quantiles to integers
	n = length(breaks)
	breaks[1] = floor(breaks[1])
	breaks[n] = ceiling(breaks[n])
	breaks[2:(n - 1)] = round(breaks[2:(n - 1)])
	breaks = unique(breaks)/10^rdig
	
	# 'cut' needs at least two intervals (three values of 'breaks'),
	# so add a few if the above calculation results in less than this
	while (length(breaks) <= 2) breaks[length(breaks) + 1] = breaks[length(breaks)] + max(diff(breaks), 1)  # 'max' needed when length(cuts) == 1
	
	# Warn if we couldn't generate the requirest number of intervals
	# (may happen when x is very short or consists of identical values)
	if (length(breaks) - 1 != ints) 
		warning(paste("Only ", length(breaks) - 1, " intervals generated, not ", ints, " as requested", sep = ""))
	
	# Expand the leftmost/righmost interval to include the min/max values of 'x'
	if (!include.lowest) {
		if (right && min(x) == breaks[1]) {
			breaks[1] = breaks[1] - 1/10^rdig
		}
		else if (!right && max(x) == breaks[length(breaks)]) {
			breaks[length(breaks)] = breaks[length(breaks)] + 1/10^rdig
		}
	}
	
	# Use 'cut' to bin the data
	cut(x, breaks, dig.lab = 12, include.lowest = include.lowest, right = right, ...)
} 

# Some examples
set.seed(1)
x = rexp(50, 1/10); range(x)   # some data
cut3(x)                        # 5 intervals (default)
cut3(x, ints=2)                # 2 intervals
cut3(x, rdig=1)                # 1 decimal digit
cut3(x, ints=100)              # less than 100 intervals generated
cut3(x, ints=100, rdig=2)      # 100 intervals generated
cut3(1:2, ints=2)              # lowest interval expanded to 0
cut3(1:2, include.lowest=TRUE) # lowest interval not expanded
cut3(5, ints=3)                # also works with very short vectors

####################################################################################

-- 
Karl Ove Hufthammer



More information about the R-help mailing list