[Rd] ?hist and $density explanation
François Pinard
pinard at iro.umontreal.ca
Thu May 18 03:20:34 CEST 2006
Hi, people. Within ?hist (using R 2.3.0), one reads:
density: values f^(x[i]), as estimated density values. If
'all(diff(breaks) == 1)', they are the relative frequencies
'counts/n' and in general satisfy sum[i; f^(x[i])
(b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.
I trip on this explanation each time I read it. Some R guardians will
be tempted to say that since R itself does not trip, I am necessarily
the problem :-). But yet, non-obstant and nevertheless, maybe these few
lines of documentation could be improved.
The "f^(x[i])" bit is somehow cryptic and not explained. It suggests
that there are as many densities as possible "i" values, and since "i"
indexes "x", it indirectly suggests that length(density) == length(x),
which cannot be right. The "sum[i; ...]" has to be taken up to the
number of cells, not the number of "x" values. Because "x[i]" is a bit
meaningless in the above context, it should better be avoided.
The "^" may mean that "x[i]" is an index of "f", some kind of TeX device
for shifting the notation. It may also means "hat" to suggest the
density is an approximation. But the approximation of what? Of course,
I understand an untold model by which "density" estimates the density of
some continuous distribution out of which the "x" values were sampled,
before the "hist()" function was called. But "x" is not necessarily
a sample of a continuum, it may well be the population, and the
densities in the histogram may well be exact, and not an approximation.
So it might be simpler to drop the "^" as well.
The concept of relative frequency is explained in case of equal width
cells only, and not otherwise. This concept is not reused elsewhere in
"?hist". So, it is not so useful, we could use "d" instead of "f".
Finally, writing "breaks[i+1]-breaks[i]" is simpler and clearer than
introducing an intermediate "b[i]" device. Let's drop it.
Let me suggest a simpler rewriting of these few lines, using humbler
notation while being more precise. Let's start with something like:
density: For each cell i, density[i] is the proportion of all x[]
which get sorted into that cell, divided by the cell width.
So, the value of 'sum(density * diff(breaks))' is 1.
and improve on it.
--
François Pinard http://pinard.progiciels-bpi.ca
More information about the R-devel
mailing list