[R] How hist() decides breaks?

Mon May 19 12:00:10 CEST 2008

(Ted Harding) wrote:
> Hi Folks,
> I'd like to know how hist() decides how many cells to use
> when it ignores my "suggestion" to use say 'hist(...,breaks=50)'.
>
> More specifically, I have the results of 10000 simulations,
> each returning an 8-vector, therefore 8 variables each with
> 10000 values. Some of these 8 have somewhat skew distributions.
> Say one of these 8 variables is X.
>
> I ask for H <- hist(X,breaks=50), and get a histogram which
> usually has a different number of cells than what I intended.
>
> For instance, for one of these simulations, the 8 different
> values of length(H$breaks) are:
>
>   70, 44, 38, 68, 50, 40, 46, 45
>
> ?hist tells me
>
> A)
>   breaks: one of:
>     *  a vector giving the breakpoints between histogram
>        cells,
>     *  a single number giving the number of cells for the
>        histogram,
>     *  a character string naming an algorithm to compute the
>        number of cells (see Details),
>     *  a function to compute the number of cells.
>
>     In the last three cases the number is a suggestion only. 
>
> B)
>   The default for 'breaks' is '"Sturges"': see 'nclass.Sturges'.
>
> If I look at the code for nclass.Sturges() I see
>
>   function (x) ceiling(log2(length(x)) + 1)
>
> and, for length(X) = 10000, this gives 15. This is not related
> to any of the numbers of breaks I actually got, in any way obvious
> to me.
>
> So:
> Question 1: hist() has apparently ignored my "suggestion" of
>   "break=50". Why? What is the criterion for ignoring?
>
> Question 2: Presumably, if it ignores the "suggestion", it
>   does something else, of its choice. I would then, perhaps,
>   expect it to fall back to its default, which is (allegedly)
>   Sturges. But the result from nclass.Sturges looks different
>   from what it actually did. So what did it actually do, and
>   how did it decide on this?
>   
No, it is not ignoring you.

Try

hist(rnorm(10000))
length(hist(rnorm(10000),breaks=50)$breaks)

and repeat a dozen of times or so. Chances are that you'll mostly see
lengths around 40, but definitely more than the 17 or so that you'll see
without the breaks=50. Next, try

diff(hist(rnorm(10000),breaks=50)$breaks)

and notice that this is usually 0.2, although if you repeat enough
times, you might get a couple of cases with 0.1 and a length of 75(-ish).

Get it? Otherwise look at help(pretty) since this is what is doing the work.

    -p

> With thanks,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 19-May-08                                       Time: 10:31:20
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>   

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907