[Rd] suggestion for extending ?as.factor
Martin Maechler
maechler at stat.math.ethz.ch
Fri May 8 15:18:01 CEST 2009
>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
>>>>> on Fri, 8 May 2009 11:01:55 +0200 writes:
PS> On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
PD> I think that the real issue is that we actually do want almost-equal
PD> numbers to be folded together.
>>
>> yes, this now (revision 48469) will happen by default, using signif(x, 15)
>> where '15' is the default for the new optional argument 'digitsLabels'
>> {better argument name? (but must nost start with 'label')}
PS> Let me analyze the current behavior of factor(x) for numeric x with missing(levels)
PS> and missing(labels). In this situation, levels are computed as sort(unique(x))
PS> from possibly transformed x. Then, labels are constructed by a conversion of the
PS> levels to strings.
PS> I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows.
PS> If keepUnique is FALSE (the default), then
PS> - values x are transformed by signif(x, digitsLabels)
PS> - labels are computed using as.character(levels)
PS> - digitsLabels defaults to 15, but may be set to any integer value
PS> If keepUnique is TRUE, then
PS> - values x are preserved
PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
PS> - digitsLabels defaults to 17, but may be set to any integer value
(in theory; in practice, I think I've suggested somewhere that
it should be >= 17; but see below.)
Your summary seems correct to me.
PS> There are several situations, when this approach produces duplicated levels.
PS> Besides the one described in my previous email, there are also others
PS> factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)
yes, but this is not much sensical; I've already contemplated
to produce a warning in such cases, something like
if(keepUnique && digitsLabels < 17)
warning(gettextf(
"'digitsLabels = %d' is typically too small when 'keepUnique' is true",
digitsLabels))
PS> factor(1 + 0:5 * 1e-16, digitsLabels=17)
again, this does not make much sense; but why disallow the useR
to shoot into his foot?
PS> I would like to suggest a modification. It eliminates most of the cases, where
PS> we get duplicated levels. It would eliminate all such cases, if the function
PS> signif() works as expected. Unfortunately, if signif() works as it does in the
PS> current versions of R, we still get duplicated levels.
PS> The suggested modification is as follows.
PS> If keepUnique is FALSE (the default), then
PS> - values x are transformed by signif(x, digitsLabels)
PS> - labels are computed using sprintf("%.*g", digitsLabels, levels)
PS> - digitsLabels defaults to 15, but may be set to any integer value
I tend like this change, given -- as you found yesterday -- that
as.character() is not even preserving 15 digits.
OTOH, as.character() has been in use for a very long history of
S (and R), whereas using sprintf() is not back compatible with
it and actually depends on the LIBC implementation of the system-sprintf.
For that reason as.character() would be preferable.
Hmm....
PS> If keepUnique is TRUE, then
PS> - values x are preserved
PS> - labels are computed using sprintf("%.*g", 17, levels)
PS> - digitsLabels is ignored
I had originally planned to do exactly the above.
However, e.g., digitsLabels = 18 may be desired in some cases,
and that's why I also left the possibility to apply it in the
keepUnique case.
PS> Arguments for the modification are the following.
PS> 1. If keepUnique is FALSE, then computing labels using as.character() leads
PS> to duplicated labels as demonstrated in my previous email. So, i suggest to
PS> use sprintf("%.*g", digitsLabels, levels) instead of as.character().
{as said above, that seems sensible, though unfurtunately quite
a bit less back-compatible!}
PS> 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get
PS> duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE.
PS> If signif(,digitsLabels) works as expected, than the above approach should not
PS> produce duplicated labels. Unfortunately, this is not the case.
PS> There are numbers, which remain different in signif(x, 16), but are mapped
PS> to the same string in sprintf("%.*g", 16, x). Examples of this kind may be
PS> found using the script
PS> for (i in 1:50) {
PS> x <- 10^runif(1, 38, 50)
PS> y <- x * (1 + 0:500 * 1e-16)
PS> y <- unique(signif(y, 16))
PS> z <- unique(sprintf("%.16g", y))
PS> stopifnot(length(y) == length(z))
PS> }
PS> This script is tested on Intel default arithmetic and on Intel with SSE.
PS> Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE.
PS> Unfortunately, a similar problem occurs even for digitsLabels = 15, although for
PS> much larger numbers.
PS> for (i in 1:200) {
PS> x <- 10^runif(1, 250, 300)
PS> y <- x * (1 + 0:500 * 1e-16)
PS> y <- unique(signif(y, 15))
PS> z <- unique(sprintf("%.15g", y))
PS> stopifnot(length(y) == length(z))
PS> }
PS> This script finds collisions, if SSE is enabled, on two
PS> Intel computers, where i did the test. Without SSE, it
PS> finds collisions only on one of them. May be, it depends
PS> also on the compiler, which is different.
probably rather on the exact implementation of the underlying C
library ("LIBC").
Thank you, Petr, for your investigations.
We all see that the simple requirement of
*no more duplicate factor levels !*
leads to considerable programming efforts for the case of
factor(<numeric>, .).
One prominent R-devel reader actually proposed to me in private,
that factor(<numeric>, .) should give a *warning* by default,
since he considered it unsafe practice.
Note that your last investigations show that your (two) proposed
changes actually do *not* solve the problem entirely;
further note that (at least inside the sources), we now say that
duplicate levels will not just signal a warning, but an error in
the future.
As long as we don't want to allow factor(<numeric>) to fail --rarely --
I think (and that actually has been a recurring daunting thought
for quite a few days) that we probably need an
extra step of checking for duplicate levels, and if we find
some, recode "everything". This will blow up the body of the
factor() function even more.
What alternatives do you (all R-devel readers!) see?
Martin
More information about the R-devel
mailing list