[Rd] suggestion for extending ?as.factor

Wed May 6 10:41:58 CEST 2009

>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Tue, 5 May 2009 10:35:42 +0200 writes:

>>>>> "PD" == Peter Dalgaard <P.Dalgaard at biostat.ku.dk>
>>>>>     on Mon, 04 May 2009 19:28:06 +0200 writes:

    PD> Petr Savicky wrote:
    >>> On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
    >>> [snip]
    >>>> Let me quickly expand the tasks we have wanted to address, when
    >>>> I started changing factor() for R-devel.
    >>>> 
    >>>> 1) R-core had unanimously decided that R 2.10.0 should not allow
    >>>> duplicated levels in factors anymore.
    >>>> 
    >>>> When working on that, I had realized that quite a few bits of code
    >>>> were implicitly relying on duplicated levels (or something
    >>>> related), see below, so the current version of R-devel only
    >>>> *warns* in some cases where duplicated levels are produced
    >>>> instead of giving an error.
    >>>> 
    >>>> What I had also found was that basically, even our own (!) code
    >>>> and quite a bit of user code has more or less relied on other
    >>>> things that were not true (even though "almost always" fulfilled):
    >>>> 
    >>>> 2) if x contains no duplicated values, then  factor(x) should neither
    >>>> 
    >>>> 3) factor(x) constructs a factor object with *unique* levels
    >>>> 
    >>>> {This is what our decision "1)" implies and now enforces}
    >>>> 
    >>>> 4) as.numeric(names(table(x))) should be  identical to unique(x)
    >>>> 
    >>>> where "4)" is basically ensured by "3)" as table() calls
    >>>> factor() for non-factor args.
    >>>> 
    >>>> As mentioned the bad thing is that "2) - 4)" are typically
    >>>> fulfilled in all tests package writers would use.
    >>>> 
    >>>> Concerning '3)' [and '1)'], as you know, inside R-core we have
    >>>> proposed to at least ensure that  `levels<-` 
    >>>> should not allow duplicated levels, 
    >>>> and I had concluded that
    >>>> a) factor() really should use  `levels<-` instead of the low-level	
    >>>> attr(., "levels") <- ....
    >>>> b) factor() itself must make sure that the default levels became unique.
    >>>> 
    >>>> ---
    >>>> 
    >>>> Given Petr's (and more) examples and the strong requirement of
    >>>> "user convenience" and back-compatibility,
    >>>> I now tend to agree (with Peter) that we cannot ensure all of 2)
    >>>> and 4) still allow factor() to behave as it did for "rounded
    >>>> decimal numbers",
    >>>> and consequently would have to (continue to) not ensuring
    >>>> properties (2) and (4).
    >>>> Something quite unfortunate, since, as I said, much useR code
    >>>> implicitly relies on these, and so that code is buggy even
    >>>> though the bug will only show in exceptional cases.

[................................]    

     PD> I think that the real issue is that we actually do want almost-equal
     PD> numbers to be folded together. 

yes, this now (revision 48469) will happen by default, using  signif(x, 15) 
where '15' is the default for the new optional argument 'digitsLabels'
{better argument name? (but must nost start with 'label')}

Why '15': Because this is most back-compatible and sufficient to
    	  solve simple arithmetic (0.1 + 0.2) issues.

    MM> in most cases, but not all {*when*  levels is not specified},
    MM> but useR's code sometimes *does* rely on  factor()  /  table()
    MM> using exact values.

    MM> Also, what should happen when the user explicitly calls

    MM> factor(x, levels = sort(unique(x)))

    MM> at least in that case we really should *not* fold almost equals.
    MM> and the "old" code (<= R 2.9.0) did fold them in border cases,
    MM> and lead non-unique levels.

    MM> Can we agree that any rounding etc - if needed - will only
    MM> happen when
    MM> 1) missing(levels)
    MM> 2) is.numeric(x) || is.complex(x)

The code I've committed (revision 48469) now does that..

    MM> I'm also thinking of at least keeping the current behavior as an
    MM> option, e.g. by  factor(x, ...., keepUniqueness = TRUE, ....)
    MM> where the default would be keepUniqueness = FALSE.

current argument name is 'keepUnique'.

    PD> The most relevant case I can conjure up is this (permutation testing):

    >>> zz <- replicate(20000,sum(sample(sleep$extra,10)))
    >>> length(table(zz))
    PD> [1] 427
    >>> length(table(signif(zz,7)))
    PD> [1] 281

    PD> Notice that the discrepancy comes from sums that really are identical
    PD> values (in decimal arithmetic), but where the binary FP inaccuracy makes
    PD> them slightly different.

    MM> Yes, that's a good example.

    MM> However, I now think it would be helpful to slightly separate
    MM> the issue from what factor() should do from 
    MM> how table() should call factor() in those cases it does.

I still believe that.
Currently,  table()  calls    " factor(a, exclude = exclude) "
when 'a' is not a factor, e.g., when it is numeric.
I propose that  table() should also gain some of the new
optional factor() arguments, and maybe even using a different
default than 15

Note that the new R-devel now gives

> set.seed(7); zz <- replicate(20000,sum(sample(sleep$extra,10)))
> length(tz <- table(zz))
[1] 283

whereas R <= 2.9.0 gives
....
[1] 422

so that at least for this examples, '15' is good enough
i.e., '7' is not needed. As mentioned above, the advantage of
'15' is that it is much closer to previous R (and S+ !) behavior
than a smaller value.

Martin