[Rd] Default for bin limits in hist()

Jose G Conde Santiago jose.conde1 at upr.edu
Wed Nov 8 22:37:42 CET 2017


Hello all.

I noticed that the default setting for breaks in the construction of histograms in hist() is “right = TRUE”.

I think “right=FALSE” would be more consistent with usual definitions of lower and upper limits for bins in applied statistics, and I suggest that you consider making it the default for hist().

For example, I generated the following frequency distribution for duration of hospitalization with a script in R specifying the cuts to be “right = FALSE” (from an exercise in Bernard Rosner’s Fundamentals of Biostatistics book).  

                number     %
[0,5)             5         0.20
[5,10)         12         0.48
[10,15)         6         0.24
[15,20)         1         0.04
[20,25)         0         0.00
[25,30]         1         0.04

The actual boundaries for each bin are: 0-4, 5-9, 10-14, … and so on since the limits on the right are “open”, with the exception of the last bin. This format is in agreement with usual practice and recommendations. Actually, it is compatible with the process described by Romer in his book (“from y inclusive to y exclusive”).

If I use R to generate a histogram with 6 bins, I get the following:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: histogram1.pdf
Type: application/pdf
Size: 4457 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20171108/8db83513/attachment.pdf>
-------------- next part --------------

… which actually presents the histogram of the frequency distribution when the “right” parameter is set as “TRUE”: 


               number     %

[0,5]             9         0.36
(5,10]           9         0.36
(10,15]         5         0.20
(15,20]         1         0.04
(20,25]         0         0.00
(25,30]         1         0.04

In this case, the real limits of the bins are 0-5, 6-10, 11-15, … and so on.

If I edit the histogram command adding “right = FALSE”, I can get the histogram for my original frequency distribution. Compare bins 1 and 2 in both distributions and histograms.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Histogram2.pdf
Type: application/pdf
Size: 4481 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20171108/8db83513/attachment-0001.pdf>
-------------- next part --------------



The actual choice of the argument for the “right” parameter may be a matter of choice, but I think most users of R would benefit from using bins with limits that are closed to the left and open to the right, and so having this setting as a default for hist().

I am aware I am writing from the limited perspective of my own field (epidemiology and biostatistics), but there are plenty of examples that show the need to consider changing the default. Here are just a few:

https://www.statcan.gc.ca/eng/concepts/definitions/age2

https://seer.cancer.gov/stdpopulations/stdpop.19ages.html

https://www.census.gov/data/tables/time-series/demo/income-poverty/cps-hinc/hinc-01.html


Thank you.

José 

José G. Conde, MD, MPH
Professor, School of Medicine
Director, CentIT2
UPR Medical Sciences Campus 

Tel  (787) 763-9401 Fax (787) 758-5206

Email: jose.conde1 at upr.edu

URL: http://rcmi.rcm.upr.edu



More information about the R-devel mailing list