[R] Displaying a distribution -- was: Combining two histograms
gunter.berton at gene.com
Wed Feb 2 17:51:51 CET 2005
May I take this off topic a little to seek collective wisdom (and so feel
free to reply privately).
The catalyst is Deepayan's remark:
> Histograms were appropriate for drawing density estimates by
> hand in the good old days, but I can imagine very few situations where I
> would not prefer to use smoother density estimates when I have the
> computational power to do so.
Generally, I agree; but the appearance and thus one's perception and
interpretation of both histograms and density plots depend upon the
parameters chosen for the display (bin boundaries for histograms; bandwidth
and kernel for density plots). Important data peculiarities like arbitrary
rounding, favoring of certain values, resolution limitations, and so forth
are therefore often lost. I would instead advocate that simple quantile
plots -- plot(ppoints(x),sort(x)) -- or perhaps normal qqplots always be the
first plot used to explore univariate data distributions. I believe this
conforms to Bill Cleveland's recommendations, who says in the first sentence
on p. 17 of VISUALIZING DATA on visualizing univariate data: "Quantiles are
essential to visualizing distributions."
While it is true that many people may be unfamiliar with quantile plots, I
think we need to improve modern statistical practice not only by abandoning
histograms in favor of density plots, but also by always first using
quantile plots and explaining why this is necessary.
Difficult issue: What should one do when when there are, say, a million
-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
"The business of the statistician is to catalyze the scientific learning
process." - George E. P. Box
More information about the R-help