# [R] Generating a vector for breaks in a histogram

M. Edward Borasky znmeb at aracnet.com
Sun Jul 6 05:39:33 CEST 2003

Well ... In my own recent example, it was plotting the raw data as a
histogram that finally directed me to the "truth" of what the data had to
say. As you may recall, the dataset was inter-arrival times of calls to a
computer routine, known only from timestamps truncated (not rounded) to the
nearest second. I started with kernel density (sm.density, with the default
parameters, to be precise) and was unsatisfied with the result. Yesterday,
when I plotted the raw counts (how many values were 0, how many 1, etc.) as
a histogram, I was struck by two things:

1. There really are only two peaks -- the "stuff" in between them is, for
the purpose of business decisions, irrelevant.

2. The inter-arrival time value "0" in such a dataset represents all the
values that are greater than or equal to zero and *less than 1*, and so on.
There is a natural "histogramming" going on via the timestamp truncation,
which implies to me that the *midpoint* of the "bin" -- say, for the 0
values, 0.5 -- is the "natural" value to choose for the "x-axis" in the
absence of any better information. This also rather neatly disposes of the
issue of zero-valued inter-arrival times. :)

Are the "old ways" best? Maybe not. Can I make reasonable business decisions
without histograms? I'm not convinced that's the case; it certainly wasn't
the case this time.

Finally, while I've never been fortunate enough to use S, the existence of R
has caused a revolution in the way I do the analysis of computer performance
data. Before R came along, the only tools I had available were Excel,
Minitab, and any special-purpose code I was willing to write to accomplish
tasks not in the vocabulary of Excel or Minitab. For example, it's
difficult, though not impossible, to do a non-linear regression or kernel
density estimation with either tool. In R, they're one-liners. If there was
a Nobel Prize for scientific software, I'd nominate R and its creators. (Of
course, there *is* a Nobel in Economics.) :)
--

> -----Original Message-----
> Things have moved on since the ASH work too, but I would
> agree that density estimation is often a better way than
> histograms.  However, close
> to state-of-the-art density estimation is built into R
> (?density) and packages `polspline', `KernSmooth' and `sm'
> are also much more advanced
> than `ash'.
>
> It was the advent of enough computing power that changed
> this, and the S
> language has been in the forefront of making the state of the art
> available.  You'll see that MASS (the book) covers histograms and
> alternatives in its chapter on Univariate Distributions, and
> it has since
> its 1994 first edition (when did you go to `school'?)