[Rd] suggestion for extending ?as.factor

Michael Dewey info at aghmed.fsnet.co.uk
Sat May 9 15:54:40 CEST 2009


At 14:18 08/05/2009, Martin Maechler wrote:

> >>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
> >>>>>     on Fri, 8 May 2009 11:01:55 +0200 writes:

Somewhere below Martin asks for alternatives from list readers. I do 
not have alternatives, but I do have two comments, one immediately 
below this, the other embedded in-line.

This whole thread reminds me just why I have spent the best part of a 
decade climbing the virtual Matterhorn called 'Learning R' and why it 
is such a pleasure to use. It is the fact that somebody, somewhere 
cares enough about consistency, usability and accuracy to devote 
hours to getting even obscure details just right.


>     PS> On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
>     PD> I think that the real issue is that we actually do want almost-equal
>     PD> numbers to be folded together.
>     >>
>     >> yes, this now (revision 48469) will happen by default, 
> using  signif(x, 15)
>     >> where '15' is the default for the new optional argument 'digitsLabels'
>     >> {better argument name? (but must nost start with 'label')}
>
>     PS> Let me analyze the current behavior of factor(x) for 
> numeric x with missing(levels)
>     PS> and missing(labels). In this situation, levels are computed 
> as sort(unique(x))
>     PS> from possibly transformed x. Then, labels are constructed 
> by a conversion of the
>     PS> levels to strings.
>
>     PS> I understand the current (R 2.10.0, 2009-05-07 r48492) 
> behavior as follows.
>
>     PS> If keepUnique is FALSE (the default), then
>     PS> - values x are transformed by signif(x, digitsLabels)
>     PS> - labels are computed using as.character(levels)
>     PS> - digitsLabels defaults to 15, but may be set to any integer value
>
>     PS> If keepUnique is TRUE, then
>     PS> - values x are preserved
>     PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
>     PS> - digitsLabels defaults to 17, but may be set to any integer value
>
>(in theory; in practice, I think I've suggested somewhere that
>  it should be  >= 17;  but see below.)
>
>Your summary seems correct to me.
>
>     PS> There are several situations, when this approach produces 
> duplicated levels.
>     PS> Besides the one described in my previous email, there are also others
>     PS> factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)
>
>yes, but this is not much sensical; I've already contemplated
>to produce a warning in such cases, something like
>
>    if(keepUnique && digitsLabels < 17)
>      warning(gettextf(
>      "'digitsLabels = %d' is typically too small when 'keepUnique' is true",
>      digitsLabels))
>
>
>     PS> factor(1 + 0:5 * 1e-16, digitsLabels=17)
>
>again, this does not make much sense; but why disallow the useR
>to shoot into his foot?

I agree. As a useR I do not want to be stopped from doing anything. I 
would appreciate a warning just before I shoot myself in the foot and 
I definitely want one if it looks like I am going to aim for my head.

>     PS> I would like to suggest a modification. It eliminates most 
> of the cases, where
>     PS> we get duplicated levels. It would eliminate all such 
> cases, if the function
>     PS> signif() works as expected. Unfortunately, if signif() 
> works as it does in the
>     PS> current versions of R, we still get duplicated levels.
>
>     PS> The suggested modification is as follows.
>
>     PS> If keepUnique is FALSE (the default), then
>     PS> - values x are transformed by signif(x, digitsLabels)
>     PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
>     PS> - digitsLabels defaults to 15, but may be set to any integer value
>
>I tend like this change, given -- as you found yesterday -- that
>as.character() is not even preserving 15 digits.
>OTOH,  as.character() has been in use for a very long history of
>S (and R), whereas using sprintf() is not back compatible with
>it and actually depends on the LIBC implementation of the system-sprintf.
>For that reason as.character() would be preferable.
>Hmm....
>
>     PS> If keepUnique is TRUE, then
>     PS> - values x are preserved
>     PS> - labels are computed using sprintf("%.*g", 17, levels)
>     PS> - digitsLabels is ignored
>
>I had originally planned to do exactly the above.
>However, e.g.,  digitsLabels = 18  may be desired in some cases,
>and that's why I also left the possibility to apply it in the
>keepUnique case.
>
>
>     PS> Arguments for the modification are the following.
>
>     PS> 1. If keepUnique is FALSE, then computing labels using 
> as.character() leads
>     PS> to duplicated labels as demonstrated in my previous email. 
> So, i suggest to
>     PS> use sprintf("%.*g", digitsLabels, levels) instead of as.character().
>
>{as said above, that seems sensible, though unfurtunately quite
>  a bit less back-compatible!}
>
>     PS> 2. If keepUnique is TRUE and we allow digitsLabels less 
> than 17, then we get
>     PS> duplicated labels. So, i suggest to force digitsLabels=17, 
> if keepUnique=TRUE.
>
>     PS> If signif(,digitsLabels) works as expected, than the above 
> approach should not
>     PS> produce duplicated labels. Unfortunately, this is not the case.
>     PS> There are numbers, which remain different in signif(x, 16), 
> but are mapped
>     PS> to the same string in sprintf("%.*g", 16, x). Examples of 
> this kind may be
>     PS> found using the script
>
>     PS> for (i in 1:50) {
>     PS> x <- 10^runif(1, 38, 50)
>     PS> y <- x * (1 + 0:500 * 1e-16)
>     PS> y <- unique(signif(y, 16))
>     PS> z <- unique(sprintf("%.16g", y))
>     PS> stopifnot(length(y) == length(z))
>     PS> }
>
>     PS> This script is tested on Intel default arithmetic and on 
> Intel with SSE.
>
>     PS> Perhaps, digitsLabels = 16 could be forbidden, if 
> keepUnique is FALSE.
>
>     PS> Unfortunately, a similar problem occurs even for 
> digitsLabels = 15, although for
>     PS> much larger numbers.
>
>     PS> for (i in 1:200) {
>     PS> x <- 10^runif(1, 250, 300)
>     PS> y <- x * (1 + 0:500 * 1e-16)
>     PS> y <- unique(signif(y, 15))
>     PS> z <- unique(sprintf("%.15g", y))
>     PS> stopifnot(length(y) == length(z))
>     PS> }
>
>     PS> This script finds collisions, if SSE is enabled, on two
>     PS> Intel computers, where i did the test. Without SSE, it
>     PS> finds collisions only on one of them. May be, it depends
>     PS> also on the compiler, which is different.
>
>probably rather on the exact implementation of the underlying C
>library ("LIBC").
>
>Thank you, Petr, for your investigations.
>We all see that the simple requirement of
>    *no more duplicate factor levels !*
>leads to considerable programming efforts for the case of
>factor(<numeric>, .).
>
>One prominent R-devel reader actually proposed to me in private,
>that  factor(<numeric>, .)  should give a *warning* by default,
>since he considered it unsafe practice.
>
>Note that your last investigations show that your (two) proposed
>changes actually do *not* solve the problem entirely;
>further note that (at least inside the sources), we now say that
>duplicate levels will not just signal a warning, but an error in
>the future.
>As long as we don't want to allow  factor(<numeric>) to fail --rarely --
>I think (and that actually has been a recurring daunting thought
>for quite a few days) that we probably need an
>extra step of checking for duplicate levels, and if we find
>some, recode "everything". This will blow up the body of the
>factor() function even more.
>
>What alternatives do you (all R-devel readers!) see?
>
>Martin
>
>______________________________________________
>R-devel at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-devel

Michael Dewey
http://www.aghmed.fsnet.co.uk



More information about the R-devel mailing list