[Rd] ?hist and $density explanation

Thu May 18 03:20:34 CEST 2006

Hi, people.  Within ?hist (using R 2.3.0), one reads:

 density: values f^(x[i]), as estimated density values. If
          'all(diff(breaks) == 1)', they are the relative frequencies
          'counts/n' and in general satisfy sum[i; f^(x[i])
          (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.

I trip on this explanation each time I read it.  Some R guardians will 
be tempted to say that since R itself does not trip, I am necessarily 
the problem :-).  But yet, non-obstant and nevertheless, maybe these few 
lines of documentation could be improved.

The "f^(x[i])" bit is somehow cryptic and not explained.  It suggests 
that there are as many densities as possible "i" values, and since "i" 
indexes "x", it indirectly suggests that length(density) == length(x), 
which cannot be right.  The "sum[i; ...]" has to be taken up to the 
number of cells, not the number of "x" values.  Because "x[i]" is a bit 
meaningless in the above context, it should better be avoided.

The "^" may mean that "x[i]" is an index of "f", some kind of TeX device 
for shifting the notation.  It may also means "hat" to suggest the 
density is an approximation.  But the approximation of what?  Of course, 
I understand an untold model by which "density" estimates the density of 
some continuous distribution out of which the "x" values were sampled, 
before the "hist()" function was called.  But "x" is not necessarily 
a sample of a continuum, it may well be the population, and the 
densities in the histogram may well be exact, and not an approximation.  
So it might be simpler to drop the "^" as well.

The concept of relative frequency is explained in case of equal width 
cells only, and not otherwise.  This concept is not reused elsewhere in 
"?hist".  So, it is not so useful, we could use "d" instead of "f".

Finally, writing "breaks[i+1]-breaks[i]" is simpler and clearer than 
introducing an intermediate "b[i]" device.  Let's drop it.

Let me suggest a simpler rewriting of these few lines, using humbler 
notation while being more precise.  Let's start with something like:

 density: For each cell i, density[i] is the proportion of all x[]
          which get sorted into that cell, divided by the cell width.
          So, the value of 'sum(density * diff(breaks))' is 1.

and improve on it.

-- 
François Pinard   http://pinard.progiciels-bpi.ca