[Rd] suggestion for extending ?as.factor

Fri May 8 15:18:01 CEST 2009

>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
>>>>>     on Fri, 8 May 2009 11:01:55 +0200 writes:

    PS> On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
    PD> I think that the real issue is that we actually do want almost-equal
    PD> numbers to be folded together. 
    >> 
    >> yes, this now (revision 48469) will happen by default, using  signif(x, 15) 
    >> where '15' is the default for the new optional argument 'digitsLabels'
    >> {better argument name? (but must nost start with 'label')}

    PS> Let me analyze the current behavior of factor(x) for numeric x with missing(levels)
    PS> and missing(labels). In this situation, levels are computed as sort(unique(x))
    PS> from possibly transformed x. Then, labels are constructed by a conversion of the
    PS> levels to strings.

    PS> I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows.

    PS> If keepUnique is FALSE (the default), then
    PS> - values x are transformed by signif(x, digitsLabels)
    PS> - labels are computed using as.character(levels)
    PS> - digitsLabels defaults to 15, but may be set to any integer value

    PS> If keepUnique is TRUE, then
    PS> - values x are preserved
    PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
    PS> - digitsLabels defaults to 17, but may be set to any integer value

(in theory; in practice, I think I've suggested somewhere that
 it should be  >= 17;  but see below.)

Your summary seems correct to me.

    PS> There are several situations, when this approach produces duplicated levels.
    PS> Besides the one described in my previous email, there are also others
    PS> factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)

yes, but this is not much sensical; I've already contemplated
to produce a warning in such cases, something like

   if(keepUnique && digitsLabels < 17)
     warning(gettextf(
     "'digitsLabels = %d' is typically too small when 'keepUnique' is true",
     digitsLabels))

    PS> factor(1 + 0:5 * 1e-16, digitsLabels=17)

again, this does not make much sense; but why disallow the useR
to shoot into his foot?

    PS> I would like to suggest a modification. It eliminates most of the cases, where
    PS> we get duplicated levels. It would eliminate all such cases, if the function
    PS> signif() works as expected. Unfortunately, if signif() works as it does in the
    PS> current versions of R, we still get duplicated levels.

    PS> The suggested modification is as follows.

    PS> If keepUnique is FALSE (the default), then
    PS> - values x are transformed by signif(x, digitsLabels)
    PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
    PS> - digitsLabels defaults to 15, but may be set to any integer value

I tend like this change, given -- as you found yesterday -- that
as.character() is not even preserving 15 digits.
OTOH,  as.character() has been in use for a very long history of
S (and R), whereas using sprintf() is not back compatible with
it and actually depends on the LIBC implementation of the system-sprintf.
For that reason as.character() would be preferable.
Hmm....

    PS> If keepUnique is TRUE, then
    PS> - values x are preserved
    PS> - labels are computed using sprintf("%.*g", 17, levels)
    PS> - digitsLabels is ignored

I had originally planned to do exactly the above.
However, e.g.,  digitsLabels = 18  may be desired in some cases,
and that's why I also left the possibility to apply it in the
keepUnique case.

    PS> Arguments for the modification are the following.

    PS> 1. If keepUnique is FALSE, then computing labels using as.character() leads
    PS> to duplicated labels as demonstrated in my previous email. So, i suggest to
    PS> use sprintf("%.*g", digitsLabels, levels) instead of as.character().

{as said above, that seems sensible, though unfurtunately quite
 a bit less back-compatible!}

    PS> 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get
    PS> duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE.

    PS> If signif(,digitsLabels) works as expected, than the above approach should not
    PS> produce duplicated labels. Unfortunately, this is not the case.
    PS> There are numbers, which remain different in signif(x, 16), but are mapped
    PS> to the same string in sprintf("%.*g", 16, x). Examples of this kind may be
    PS> found using the script

    PS> for (i in 1:50) {
    PS> x <- 10^runif(1, 38, 50)
    PS> y <- x * (1 + 0:500 * 1e-16)
    PS> y <- unique(signif(y, 16))
    PS> z <- unique(sprintf("%.16g", y))
    PS> stopifnot(length(y) == length(z))
    PS> }

    PS> This script is tested on Intel default arithmetic and on Intel with SSE.

    PS> Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE.

    PS> Unfortunately, a similar problem occurs even for digitsLabels = 15, although for
    PS> much larger numbers.

    PS> for (i in 1:200) {
    PS> x <- 10^runif(1, 250, 300)
    PS> y <- x * (1 + 0:500 * 1e-16)
    PS> y <- unique(signif(y, 15))
    PS> z <- unique(sprintf("%.15g", y))
    PS> stopifnot(length(y) == length(z))
    PS> }

    PS> This script finds collisions, if SSE is enabled, on two
    PS> Intel computers, where i did the test. Without SSE, it
    PS> finds collisions only on one of them. May be, it depends
    PS> also on the compiler, which is different.

probably rather on the exact implementation of the underlying C
library ("LIBC").

Thank you, Petr, for your investigations.
We all see that the simple requirement of  
   *no more duplicate factor levels !*
leads to considerable programming efforts for the case of
factor(<numeric>, .).

One prominent R-devel reader actually proposed to me in private,
that  factor(<numeric>, .)  should give a *warning* by default,
since he considered it unsafe practice.

Note that your last investigations show that your (two) proposed
changes actually do *not* solve the problem entirely;
further note that (at least inside the sources), we now say that
duplicate levels will not just signal a warning, but an error in
the future.
As long as we don't want to allow  factor(<numeric>) to fail --rarely -- 
I think (and that actually has been a recurring daunting thought
for quite a few days) that we probably need an
extra step of checking for duplicate levels, and if we find
some, recode "everything". This will blow up the body of the
factor() function even more.

What alternatives do you (all R-devel readers!) see?

Martin