[R] Displaying a distribution -- was: Combining two histograms

Wed Feb 2 17:51:51 CET 2005

May I take this off topic a little to seek collective wisdom (and so feel
free to reply privately).

The catalyst is Deepayan's remark:

> Histograms were appropriate for drawing density estimates by 
> hand in the  good old days, but I can imagine very few situations where I 
> would not prefer to use smoother density estimates when I have the 
> computational power to do so.
> 
> Deepayan

Generally, I agree; but the appearance and thus one's perception and
interpretation of both histograms and density plots depend upon the
parameters chosen for the display (bin boundaries for histograms; bandwidth
and kernel for density plots). Important data peculiarities like arbitrary
rounding, favoring of certain values, resolution limitations, and so forth
are therefore often lost. I would instead advocate that simple quantile
plots -- plot(ppoints(x),sort(x)) -- or perhaps normal qqplots always be the
first plot used to explore univariate data distributions. I believe this
conforms to Bill Cleveland's recommendations, who says in the first sentence
on p. 17 of VISUALIZING DATA on visualizing univariate data: "Quantiles are
essential to visualizing distributions."

While it is true that many people may be unfamiliar with quantile plots, I
think we need to improve modern statistical practice not only by abandoning
histograms in favor of density plots, but also by always first using
quantile plots and explaining why this is necessary.

Difficult issue: What should one do when when there are, say, a million
values?

Alternative views?

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box