[Rd] suggestion for extending ?as.factor
Martin Maechler
maechler at stat.math.ethz.ch
Mon May 11 17:06:38 CEST 2009
>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
>>>>> on Sun, 10 May 2009 13:52:53 +0200 writes:
PS> On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote:
PS> [...]
>> If'd revert to such a solution,
>> we'd have to get back to Peter's point about the issue that
>> he'd think table(.) should be more tolerant than as.character()
>> about "almost equality".
>> For compatibility reasons, we could also return back to the
>> reasoning that useR should use {something like}
>> table(signif(x, 14))
>> instead of
>> table(x)
>> for numeric x in "typical" cases.
PS> In the released versions 2.8.1 and 2.9.0, function factor() satisfies
PS> identical(as.character(factor(x)), as.character(x)) (*)
PS> for all numeric x. This follows from the code (levels are computed by
PS> as.character() from unmodified input values) and may be verified
PS> even for the problematic cases, for example
PS> x <- (0.3 + 2e-16 * c(-2,-1,1,2))
PS> factor(x)
PS> # [1] 0.300000000000000 0.3 0.3 0.300000000000000
PS> # Levels: 0.300000000000000 0.3 0.3 0.300000000000000
PS> as.character(x)
PS> # [1] "0.300000000000000" "0.3" "0.3"
PS> # [4] "0.300000000000000"
PS> identical(as.character(factor(x)), as.character(x))
PS> # [1] TRUE
PS> In my opinion, it is reasonable to require that (*) be
PS> preserved also in future versions of R.
PS> Function as.character(x) has disadvantages. Besides of
PS> the platform dependence, it also does not always perform
PS> rounding needed to eliminate FP errors. Usually,
PS> as.character(x) rounds to at most 15 digits, so, we get,
PS> for example
PS> as.character(0.1 + 0.2) # [1] "0.3"
PS> as required. However, there are also exceptions, for example
PS> as.character(1e19 + 1e5) # [1] "10000000000000100352"
PS> Here, the number is printed exactly, so the resulting
PS> string contains the FP error caused by the fact that
PS> 1e19 + 1e5 has more than 53 significant digits in binary
PS> representation, namely 59.
PS> binary representation of 1e19 + 1e5 is
PS> 1000101011000111001000110000010010001001111010011000011010100000
PS> binary representation of 10000000000000100352 is
PS> 1000101011000111001000110000010010001001111010011000100000000000
PS> However, as.character(x) seems to do enough rounding for
PS> most purposes, otherwise it would not be suitable as the
PS> basic numeric to character conversion. If table() needs
PS> factor() with a different conversion than
PS> as.character(x), it may be done explicitly as discussed
PS> by Martin above.
PS> So, i suggest to use as.character() as the default
PS> conversion in factor(), so that
PS> identical(as.character(factor(x)), as.character(x)) is
PS> satisfied for the default usage of factor().
PS> Of course, i appreciate, if factor() has parameters,
PS> which allow better control of the underlying conversion,
PS> as it is done in the current development versions.
The version I have committed a few hours ago is indeed a much
re-simplified version, using as.character(.) explicitly
and consequently no longer providing the extra optional
arguments that we have had for a couple of days.
Keeping such a basic function factor() as simple as possible
seems a good strategy to me.
Martin Maechler
More information about the R-devel
mailing list