[Rd] suggestion for extending ?as.factor

Mon May 4 18:34:03 CEST 2009

On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
[snip]
> Let me quickly expand the tasks we have wanted to address, when
> I started changing factor() for R-devel.
> 
> 1) R-core had unanimously decided that R 2.10.0 should not allow
>    duplicated levels in factors anymore.
> 
> When working on that, I had realized that quite a few bits of code
> were implicitly relying on duplicated levels (or something
> related), see below, so the current version of R-devel only
> *warns* in some cases where duplicated levels are produced
> instead of giving an error.
> 
> What I had also found was that basically, even our own (!) code
> and quite a bit of user code has more or less relied on other
> things that were not true (even though "almost always" fulfilled):
> 
> 2) if x contains no duplicated values, then  factor(x) should neither
> 
> 3) factor(x) constructs a factor object with *unique* levels
> 
>   {This is what our decision "1)" implies and now enforces}
> 
> 4) as.numeric(names(table(x))) should be  identical to unique(x)
> 
>   where "4)" is basically ensured by "3)" as table() calls
>   factor() for non-factor args.
> 
> As mentioned the bad thing is that "2) - 4)" are typically
> fulfilled in all tests package writers would use.
> 
> Concerning '3)' [and '1)'], as you know, inside R-core we have
> proposed to at least ensure that  `levels<-` 
> should not allow duplicated levels, 
> and I had concluded that
> a) factor() really should use  `levels<-` instead of the low-level	
>    attr(., "levels") <- ....
> b) factor() itself must make sure that the default levels became unique.
> 
> ---
> 
> Given Petr's (and more) examples and the strong requirement of
> "user convenience" and back-compatibility,
> I now tend to agree (with Peter) that we cannot ensure all of 2)
> and 4) still allow factor() to behave as it did for "rounded
> decimal numbers",
> and consequently would have to (continue to) not ensuring
> properties (2) and (4).
> Something quite unfortunate, since, as I said, much useR code
> implicitly relies on these, and so that code is buggy even
> though the bug will only show in exceptional cases.

Let me suggest to consider also the following algorithm: determine
the number of digits needed to preserve the double value exactly for
each number separately. An R code prototype demonstrating the 
algorithm could be as follows

  convert <- function(x) # x should be a single number
  {
      for (d in 1:16) {
          y <- sprintf(paste("%.", d, "g", sep=""), x)
          if (x == as.numeric(y)) {
              return(y)
          }
      }
      return(sprintf("%.17g", x))
  }

For this, we get

  > convert(0.3)
  [1] "0.3"
  > convert(1/3)
  [1] "0.3333333333333333" # 16 digits suffice
  > convert(0.12345)
  [1] "0.12345"
  > convert(0.12345678901234567)
  [1] "0.12345678901234566"
  > 0.12345678901234567 == as.numeric("0.12345678901234566")
  [1] TRUE

This algorithm is slower than a single call to sprintf("%.17g", x), but it
produces nicer numbers, if possible, and guarantees that the value is
always preserved.

Petr.