[R] histogram first bar wrong position

Martin Maechler maechler at stat.math.ethz.ch
Fri Dec 23 13:01:16 CET 2016


>>>>> William Dunlap <wdunlap at tibco.com>
>>>>>     on Thu, 22 Dec 2016 09:08:35 -0800 writes:

    > As a practical matter, 'continuous' data must be discretized, so if you
    > have long vectors of it you will run into this problem.

    > Bill Dunlap
    > TIBCO Software
    > wdunlap tibco.com

Yes, it is true that on the computer and in statistics we never have
continuous data in the strict sense.

My point was  and still is that a histogram is a wrong graphical
tool to be used for visualizing a distribution
on a small finite set, as e.g., the dice rolls 'itpro' has used.

And yes, if (s)he used something like

   dice <- ceiling(6 * runif(100))

and really prefers to use  hist() over (something like)

   plot(table(dice), lwd = 6)

then an appropriate graphic would rather be

  hist(dice, freq=TRUE, col="orange", breaks = (31:(6*32))/32)

(and the default breaks from sample size N = 100'000 is indeed
 relatively close to that because as we both know the number of
 default breaks grows (slowly) with N).

For me, histograms are a (poor but easy to understand and
explain) version of density estimates  (where the underlying
density is wrt to the lebesgue measure or simlar).

Now back to large / long vectors of data:
If you need to bin large vectors, you will hopefully be binning
to rather 100's or 1000's of bins (because 1000 is still much
smaller than "large") and then you actually have computed the
data for a histogram yourself already; so I personally would
again prefer not to use hist(), but to write my own "3 line"
function that returns an "histogram" object which I'd call  plot(.) on.

So, maybe providing such a short function maybe useful, notably
on the ?hist  help page ?

Martin Maechler,
ETH Zurich


    > On Thu, Dec 22, 2016 at 8:19 AM, Martin Maechler <maechler at stat.math.ethz.ch
    >> wrote:

    >> >>>>> itpro  <itpro1 at yandex.ru>
    >> >>>>>     on Thu, 22 Dec 2016 16:17:28 +0300 writes:
    >> 
    >> > Hi, everyone.
    >> > I stumbled upon weird histogram behaviour.
    >> 
    >> > Consider this "dice emulator":
    >> > Step 1: Generate uniform random array x of size N.
    >> > Step 2: Multiply each item by six and round to next bigger integer
    >> to get numbers 1 to 6.
    >> > Step 3: Plot histogram.
    >> 
    >> >> x<-runif(N)
    >> >> y<-ceiling(x*6)
    >> >> hist(y,freq=TRUE, col='orange')
    >> 
    >> 
    >> > Now what I get with N=100000
    >> 
    >> >> x<-runif(100000)
    >> >> y<-ceiling(x*6)
    >> >> hist(y,freq=TRUE, col='green')
    >> 
    >> > At first glance looks OK.
    >> 
    >> > Now try N=100
    >> 
    >> >> x<-runif(100)
    >> >> y<-ceiling(x*6)
    >> >> hist(y,freq=TRUE, col='red')
    >> 
    >> > Now first bar is not where it should be.
    >> > Hmm. Look again to 100000 histogram... First bar is not where I want
    >> it, it's only less striking due to narrow bars.
    >> 
    >> > So, first bar is always in wrong position. How do I fix it to make
    >> perfectly spaced bars?
    >> 
    >> Don't use histograms *at all* for such discrete integer data.
    >> 
    >> N <- rpois(100, 5)
    >> plot(table(N), lwd = 4)
    >> 
    >> Histograms should be only be used for continuous data (or discrete data
    >> with "many" possible values).
    >> 
    >> It's a pain to see them so often "misused" for data like the 'N' above.
    >> 
    >> Martin Maechler,
    >> ETH Zurich
    >> 
    >> ______________________________________________
    >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    >> https://stat.ethz.ch/mailman/listinfo/r-help
    >> PLEASE do read the posting guide http://www.R-project.org/
    >> posting-guide.html
    >> and provide commented, minimal, self-contained, reproducible code.
    >> 

    > [[alternative HTML version deleted]]



More information about the R-help mailing list