[R] density of hist(freq = FALSE) inversely affected by data magnitude
wdunlap at tibco.com
Wed Jan 23 00:33:24 CET 2013
The probability density function is not unitless - it is the derivative of the
[cumulative] probability distribution function so it has units delta-probability-mass
over delta-x. It must integrate to 1 (over the all possible x). hist(freq=FALSE,x)
or hist(prob=TRUE,x) displays an estimate of the density function and the following
example shows how the scale matches what you get from the presumed
population density function.
function (n, sd)
x <- rnorm(n, sd = sd)
hist(x, freq = FALSE) # estimated density
s <- seq(min(x), max(x), len = 129)
lines(s, dnorm(s, sd = sd), col = "red") # overlay expected density for this sample
> f(1e6, sd=1)
> f(100, sd=1)
> f(100, sd=0.0001)
> f(1e6, sd=0.0001)
Spotfire, TIBCO Software
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of J Toll
> Sent: Tuesday, January 22, 2013 2:48 PM
> To: r-help
> Subject: [R] density of hist(freq = FALSE) inversely affected by data magnitude
> I have a couple of observations, a question or two, and perhaps a
> suggestion related to the plotting of density on the y-axis within the
> hist() function when freq=FALSE. I was using the function and trying
> to develop an intuitive understanding of what the density is telling
> me. After reading through this fairly helpful post:
> I finally realized that in the case where freq = FALSE, the y-axis
> isn't really telling me the density. It's actually indicating the
> density multiplied by the bin size. I assume this is for the case
> where the bins may be of non-regular size.
> from hist.default:
> dens <- counts/(n * diff(breaks))
> So the count in each bin is divided by the total number of
> observations (n) multiplied by the size of the bin. The problem, as I
> see it, is that the density ends up being scaled by the size of the
> bins, which is inversely proportional to the magnitude of the data.
> Therefore the magnitude of the data is directly affecting the density,
> which seems problematic.
> For example*:
> x <- runif(100)
> y <- x / 1000
> par(mfrow = c(2, 1))
> hist(x, prob = TRUE)
> hist(y, prob = TRUE)
> >From this example, you see that the density for the y histogram is
> 1000 times larger, simply because the y data is 1000 times smaller.
> Again, that seems problematic. It seems to me, that the density
> should be unit-less, but here it's affected by the magnitude of the
> So, my question is, why is density calculated this way?
> For the case where all the bins are of the same size, I would think
> density should simply be calculated as:
> dens <- counts / n
> Of course, that might be somewhat misleading for the case where the
> bin sizes vary. So then why not calculate density as:
> dens <- counts / (n * diff(breaks) / min(diff(breaks)))
> Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect
> of the magnitude of the data, and simply leaves the relative
> difference in bin size.
> For the case where all the bins are the same size, the calculation is
> equivalent to dens <- counts / n
> For all other cases, the density is scaled by the size of the bin, but
> unaffected by the magnitude of the data.
> So, what am I misunderstanding? Why is density calculated as it is,
> and what does it mean?
> *example from http://stats.stackexchange.com/questions/17258/odd-problem-with-a-
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help