[Rd] nclass.scott() and nclass.FD() {Re: [R] truehist bug?}

Tue Mar 20 10:17:02 CET 2007

[This has become entirely a topic for 'R-devel' hence, I'm
 diverting to there keeping R-help once; please follow-up
 only to R-devel ]

>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Tue, 20 Mar 2007 08:49:16 +0100 writes:

>>>>> "Gad" == Gad Abraham <g.abraham at ms.unimelb.edu.au>
>>>>>     on Tue, 20 Mar 2007 17:02:18 +1100 writes:

    Gad> Hi,
    Gad> Is this a bug in truehist()?

    >>> library(MASS)
    >>> x <- rep(1, 10)
    >>> truehist(x)
    Gad> Error in pretty(data, nbins) : invalid 'n' value

    MM> You get something similar though slightly more helpful
    MM> from
    MM>    hist(x, "scott")

    MM> which then uses the same method for determining the number of bins /
    MM> classes for the histogram.

    MM> I'd say the main "bug" is in   
    MM> nclass.scott()   [ and  also nclass.FD() ]

    MM> which do not work when  var(x) == 0  as in this case.
    MM> One could argue that  

    MM> 1) truehist(x) should not use "scott" as
    MM> default when var(x) == 0   {hence a buglet in truehist()}

    MM> and either

    MM> 2) both hist() and truehist() should produce a better error
    MM> message when "scott" (or "FD") is used explicitly and var(x) == 0

    MM> or, rather IMO,

    MM> 3) nclass.scott(x) and nclass.FD(x) should be extended to return a 
    MM> non-negative integer even when  var(x) == 0

after some further thought,
I'm proposing to adopt '3)'  {only; '1)' becomes unneeded}
with the following new code  which is back-compatible for the
case where 'h > 0' and does something reasonable for the case h == 0 :

nclass.scott <- function(x)
{
    h <- 3.5 * sqrt(stats::var(x)) * length(x)^(-1/3)
    if(h > 0) ceiling(diff(range(x))/h) else 1L
}

nclass.FD <- function(x)
{
    h <- stats::IQR(x)
    if(h == 0) h <- stats::mad(x, constant = 2) # c=2: consistent with IQR
    if (h > 0) ceiling(diff(range(x))/(2 * h * length(x)^(-1/3))) else 1L
}


Martin