hist()

Martin Maechler Martin Maechler <maechler@stat.math.ethz.ch>
Thu, 26 Nov 1998 14:28:12 +0100


>>>>> "BDR" == Prof Brian Ripley <ripley@stats.ox.ac.uk> writes [Mon 16 Nov]:

    >> From: Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk> Date: 16 Nov
    >> 1998 14:26:38 +0100
    >> 
    >> Going over my old notes, I realised that hist() has changed since
    >> the earlier versions of R, in that the intervals are now
    >> left-open,right-closed rather than the opposite. This is a change in
    >> the direction of S-plus compatibility, but I wonder how sensible it
    >> really is.

    BDR> Since a histogram is intended to be of continuous data, it should
    BDR> not really matter. And if the data are already rounded (like ages)
    BDR> it matters how they were rounded. (One of my grandfathers would
    BDR> say he was `in his eighty-second year', so the rounding direction
    BDR> may even depend on the value.)  My example below shows that
    BDR> rounding error can play havoc with one's intentions too.

    BDR> Since cut in R has an argument `right', could not hist have too?
I agree.
    BDR> And I don't actually understand why hist.default does not call
    BDR> cut.default directly, when enhancements to cut will be easy to
    BDR> apply to hist.
yes that would be the advantage.
The current implementation, using .C("bincount",..) instead of
.C("bincode"..) is however much less memory hungry for large x
than  table(cut(...)).

    BDR> Then we can argue about the default for right. It
    BDR> seems to me that it should be the same for cut and for hist, and I
    BDR> would argue for not changing the current default (whatever it
    BDR> might have been).
the current default is the same for cut & hist, namely "right = TRUE".
I agree that it shouldn't be changed -- for mere compatibility --

The help page could/should mention that for many people,
	right = FALSE
seems more natural.

    BDR> Oh, and what do bincode and bincode2 do about rounding error?
    BDR> (Nothing, I think.) 

You are right: nothing.  [src/appl/binning.c]
However, do you think they *should* do anything at all?
If a 'break' ("cut point") is close to a data value, there's still only one
side of the break that the data value can go....

    BDR> The default breaks in cut and hist are
    BDR> computed quantities, and as we know, seq does not always get them
    BDR> exactly right.  As in:

    >> cut(c(0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85), 6)
    BDR> [1] (0.249,0.35] (0.35,0.45] (0.45,0.55] (0.45,0.55] (0.55,0.65]
    BDR> [6] (0.65,0.75] (0.75,0.851]

    >> cut(c(0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85), 6, right=FALSE)
    BDR> [1] [0.249,0.35) [0.35,0.45) [0.45,0.55) [0.55,0.65) [0.55,0.65)
    BDR> [6] [0.65,0.75) [0.75,0.851)

I don't think there's any real bug here:

> cut(c(0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85), 6, dig.lab = 5) 
[1] (0.2494,0.3496] (0.3496,0.4498] (0.4498,0.55]   (0.4498,0.55]  
[5] (0.55,0.6502]   (0.6502,0.7504] (0.7504,0.8506]
Levels:  (0.2494,0.3496] (0.3496,0.4498] (0.4498,0.55] (0.55,0.6502] (0.6502,0.7504] (0.7504,0.8506] 

However, for the case of computed breaks (as above),
I think cut() should construct 'breaks' to which no data value is close,
i.e. the ``breaks construction'' should consider rounding problems!

Thank you for revealing this!


Martin
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._