[R] Generating a vector for breaks in a histogram

Sun Jul 6 05:39:33 CEST 2003

Well ... In my own recent example, it was plotting the raw data as a
histogram that finally directed me to the "truth" of what the data had to
say. As you may recall, the dataset was inter-arrival times of calls to a
computer routine, known only from timestamps truncated (not rounded) to the
nearest second. I started with kernel density (sm.density, with the default
parameters, to be precise) and was unsatisfied with the result. Yesterday,
when I plotted the raw counts (how many values were 0, how many 1, etc.) as
a histogram, I was struck by two things:

1. There really are only two peaks -- the "stuff" in between them is, for
the purpose of business decisions, irrelevant.

2. The inter-arrival time value "0" in such a dataset represents all the
values that are greater than or equal to zero and *less than 1*, and so on.
There is a natural "histogramming" going on via the timestamp truncation,
which implies to me that the *midpoint* of the "bin" -- say, for the 0
values, 0.5 -- is the "natural" value to choose for the "x-axis" in the
absence of any better information. This also rather neatly disposes of the
issue of zero-valued inter-arrival times. :)

Are the "old ways" best? Maybe not. Can I make reasonable business decisions
without histograms? I'm not convinced that's the case; it certainly wasn't
the case this time.

Finally, while I've never been fortunate enough to use S, the existence of R
has caused a revolution in the way I do the analysis of computer performance
data. Before R came along, the only tools I had available were Excel,
Minitab, and any special-purpose code I was willing to write to accomplish
tasks not in the vocabulary of Excel or Minitab. For example, it's
difficult, though not impossible, to do a non-linear regression or kernel
density estimation with either tool. In R, they're one-liners. If there was
a Nobel Prize for scientific software, I'd nominate R and its creators. (Of
course, there *is* a Nobel in Economics.) :)
-- 
M. Edward (Ed) Borasky
mailto:znmeb at borasky-research.net
http://www.borasky-research.net

> -----Original Message-----
> Things have moved on since the ASH work too, but I would 
> agree that density estimation is often a better way than 
> histograms.  However, close 
> to state-of-the-art density estimation is built into R 
> (?density) and packages `polspline', `KernSmooth' and `sm' 
> are also much more advanced 
> than `ash'. 
> 
> It was the advent of enough computing power that changed 
> this, and the S 
> language has been in the forefront of making the state of the art 
> available.  You'll see that MASS (the book) covers histograms and 
> alternatives in its chapter on Univariate Distributions, and 
> it has since 
> its 1994 first edition (when did you go to `school'?)