# [R] How to superimpose a histogram and density plot

Martin Maechler maechler at stat.math.ethz.ch
Tue Jun 8 10:24:49 CEST 1999

```>>>>> "PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:

PD> "Venables, Bill (CMIS, Cleveland)" <Bill.Venables at cmis.CSIRO.AU>
PD> writes:
>> The fact that every elementary book on statistics does it this way
>> does not make it correct.  To be helpful, a histogram really has to
>> be a non-parametric density estimator, period.
>>

PD> Not quite! There is a reason for doing it the other way, namely
PD> that the concept of a histogram generally comes before the concept
PD> of a probability density, pedagogically. It is very easy to explain
PD> that you chop up the axis into bins and count the number of data
PD> points that fall in each of them. I bet that half of the MDs that I
PD> teach never quite understand the density (hell, the author of the
PD> textbook I use managed to plot three identical gaussian curves with
PD> identical y axis but different x axes... and he's a
PD> statistician). So for the basic uses of the histogram, one would be
PD> replacing a perfectly intuitive simple unit with a substantially
PD> more complex one.

I agree 100% with Peter.
Being a mathematician I agree with Bill that for us, a histogram is a
(very suboptimal) density estimate;  but average statistics software users
*do* learn histograms differently..
-- quite a few ``learn'' histograms even before high-school...

>> If you want a density estimate and a histogram
>> on the same scale, I suggest you try something like this:
>>
>> > IQR <- diff(summary(data)[c(5,2)])

with R, the above line is superfluous:

1) IQR(.) is already an R function!
2) density(.) in R *has* a reasonable default bandwidth (contrary to S),
namely Silverman's rule of thumb

>> > dest <- density(data, width = 2*IQR)  # or some smaller width, maybe,
>> > hist(data, xlim = range(dest\$x), xlab = "x", ylab = "density",
>> +      probability = TRUE)    # <<<--- this is the vital argument
>> > lines(dest, lty=2)

PD> Yep. frequency=FALSE has the same effect and might be more logical,
PD> since the y-axis is not really probability but "probability per x
PD> unit".

dest <- density(data)
hist(data, xlim = range(dest\$x), xlab = "x", ylab = "density", freq = FALSE)
lines(dest, lty=2)
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

```