[Rd] suggestion for extending ?as.factor

Peter Dalgaard P.Dalgaard at biostat.ku.dk
Mon May 4 19:28:06 CEST 2009


Petr Savicky wrote:
> On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
> [snip]
>> Let me quickly expand the tasks we have wanted to address, when
>> I started changing factor() for R-devel.
>>
>> 1) R-core had unanimously decided that R 2.10.0 should not allow
>>    duplicated levels in factors anymore.
>>
>> When working on that, I had realized that quite a few bits of code
>> were implicitly relying on duplicated levels (or something
>> related), see below, so the current version of R-devel only
>> *warns* in some cases where duplicated levels are produced
>> instead of giving an error.
>>
>> What I had also found was that basically, even our own (!) code
>> and quite a bit of user code has more or less relied on other
>> things that were not true (even though "almost always" fulfilled):
>>
>> 2) if x contains no duplicated values, then  factor(x) should neither
>>
>> 3) factor(x) constructs a factor object with *unique* levels
>>
>>   {This is what our decision "1)" implies and now enforces}
>>
>> 4) as.numeric(names(table(x))) should be  identical to unique(x)
>>
>>   where "4)" is basically ensured by "3)" as table() calls
>>   factor() for non-factor args.
>>
>> As mentioned the bad thing is that "2) - 4)" are typically
>> fulfilled in all tests package writers would use.
>>
>> Concerning '3)' [and '1)'], as you know, inside R-core we have
>> proposed to at least ensure that  `levels<-` 
>> should not allow duplicated levels, 
>> and I had concluded that
>> a) factor() really should use  `levels<-` instead of the low-level	
>>    attr(., "levels") <- ....
>> b) factor() itself must make sure that the default levels became unique.
>>
>> ---
>>
>> Given Petr's (and more) examples and the strong requirement of
>> "user convenience" and back-compatibility,
>> I now tend to agree (with Peter) that we cannot ensure all of 2)
>> and 4) still allow factor() to behave as it did for "rounded
>> decimal numbers",
>> and consequently would have to (continue to) not ensuring
>> properties (2) and (4).
>> Something quite unfortunate, since, as I said, much useR code
>> implicitly relies on these, and so that code is buggy even
>> though the bug will only show in exceptional cases.
> 
> Let me suggest to consider also the following algorithm: determine
> the number of digits needed to preserve the double value exactly for
> each number separately. An R code prototype demonstrating the 
> algorithm could be as follows
> 
>   convert <- function(x) # x should be a single number
>   {
>       for (d in 1:16) {
>           y <- sprintf(paste("%.", d, "g", sep=""), x)
>           if (x == as.numeric(y)) {
>               return(y)
>           }
>       }
>       return(sprintf("%.17g", x))
>   }
> 
> For this, we get
> 
>   > convert(0.3)
>   [1] "0.3"
>   > convert(1/3)
>   [1] "0.3333333333333333" # 16 digits suffice
>   > convert(0.12345)
>   [1] "0.12345"
>   > convert(0.12345678901234567)
>   [1] "0.12345678901234566"
>   > 0.12345678901234567 == as.numeric("0.12345678901234566")
>   [1] TRUE
> 
> This algorithm is slower than a single call to sprintf("%.17g", x), but it
> produces nicer numbers, if possible, and guarantees that the value is
> always preserved.

Yes, but....

> convert(0.2+0.1)
[1] "0.30000000000000004"


I think that the real issue is that we actually do want almost-equal
numbers to be folded together. The most relevant case I can conjure up
is this (permutation testing):

> zz <- replicate(20000,sum(sample(sleep$extra,10)))
> length(table(zz))
[1] 427
> length(table(signif(zz,7)))
[1] 281

Notice that the discrepancy comes from sums that really are identical
values (in decimal arithmetic), but where the binary FP inaccuracy makes
them slightly different.

[for a nice picture, continue the example with

> tt <- table(signif(zz,7))
> plot(as.numeric(names(tt)),tt, type="h")

]

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907



More information about the R-devel mailing list