[R] How to superimpose a histogram and density plot
Martin Maechler
maechler at stat.math.ethz.ch
Tue Jun 8 10:24:49 CEST 1999
>>>>> "PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
PD> "Venables, Bill (CMIS, Cleveland)" <Bill.Venables at cmis.CSIRO.AU>
PD> writes:
>> The fact that every elementary book on statistics does it this way
>> does not make it correct. To be helpful, a histogram really has to
>> be a non-parametric density estimator, period.
>>
>> Enough already of polemics.
PD> Not quite! There is a reason for doing it the other way, namely
PD> that the concept of a histogram generally comes before the concept
PD> of a probability density, pedagogically. It is very easy to explain
PD> that you chop up the axis into bins and count the number of data
PD> points that fall in each of them. I bet that half of the MDs that I
PD> teach never quite understand the density (hell, the author of the
PD> textbook I use managed to plot three identical gaussian curves with
PD> identical y axis but different x axes... and he's a
PD> statistician). So for the basic uses of the histogram, one would be
PD> replacing a perfectly intuitive simple unit with a substantially
PD> more complex one.
I agree 100% with Peter.
Being a mathematician I agree with Bill that for us, a histogram is a
(very suboptimal) density estimate; but average statistics software users
*do* learn histograms differently..
-- quite a few ``learn'' histograms even before high-school...
>> If you want a density estimate and a histogram
>> on the same scale, I suggest you try something like this:
>>
>> > IQR <- diff(summary(data)[c(5,2)])
with R, the above line is superfluous:
1) IQR(.) is already an R function!
2) density(.) in R *has* a reasonable default bandwidth (contrary to S),
namely Silverman's rule of thumb
>> > dest <- density(data, width = 2*IQR) # or some smaller width, maybe,
>> > hist(data, xlim = range(dest$x), xlab = "x", ylab = "density",
>> + probability = TRUE) # <<<--- this is the vital argument
>> > lines(dest, lty=2)
PD> Yep. frequency=FALSE has the same effect and might be more logical,
PD> since the y-axis is not really probability but "probability per x
PD> unit".
which in sum leads to
dest <- density(data)
hist(data, xlim = range(dest$x), xlab = "x", ylab = "density", freq = FALSE)
lines(dest, lty=2)
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list