[Rd] suggestion for extending ?as.factor
Peter Dalgaard
P.Dalgaard at biostat.ku.dk
Mon May 4 15:34:09 CEST 2009
Martin Maechler wrote:
>>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
>>>>>> on Sun, 3 May 2009 22:32:04 +0200 writes:
>>>>>> "PS" == Petr Savicky <savicky at cs.cas.cz>
>>>>>> on Sun, 3 May 2009 22:32:04 +0200 writes:
>
> PS> In R-2.10.0, the development version, function as.factor() uses 17 digit
> PS> precision for conversion of numeric values to character type. This
> PS> is very good for the consistency of the resulting factor, however,
> PS> i expect that people will complain about, for example, as.factor(0.3)
> PS> being
> PS> [1] 0.29999999999999999
> PS> Levels: 0.29999999999999999
>
> PS> I suggest to extend the "Warning" section of ?as.factor by the following
> PS> paragraph.
>
> PS> If as.factor() is used for a numeric vector, then the numbers are
> PS> converted to character strings with 17 digit precision using their
> PS> machine representation. This guarantees that different numbers are
> PS> converted to different levels, but may produce unwanted results, if
> PS> the numbers are expected to have limited number of decimal positions.
> PS> For example, as.factor(c(0.1, 0.2, 0.3)) produces
> PS> [1] 0.10000000000000001 0.20000000000000001 0.29999999999999999
> PS> Levels: 0.10000000000000001 0.20000000000000001 0.29999999999999999
> PS> In order to avoid this, convert the numbers to a character vector
> PS> using formatC() or a similar function before using as.factor().
>
> PS> Petr.
>
> Thank you, Petr, for the good suggestion.
>
> I have added a (shorter) paragraph, though to the 'Details' not the
> 'Warning' section, and also one to the 'Examples' :
>
> ## Converting (non-integer) numbers:
> as.factor(c(0.1, 0.2, 0.3)) # maybe not what you'd expect, so rather use
> factor(format(c(0.1, 0.2, 0.3)))
Martin,
I tend to consider this a bug, plain and simple. We might as well have
abolished conversion of numerics to factor altogether. (Notice, BTW,
that conversions to mode "character" changes the sort order so format()
is not a universal fix. IIRC, we did consider the 1 10 2 3 4 5 6 7 8 9
issue when designing R's version factor().)
The current R-devel behaviour is silly and we should just get rid of it
before a final release. It should be the other way around: If people
rely on whether numerical factor levels differ with 17 digits precision,
THEN they should use format with suitable arguments.
If we have issues with numeric values that are very slightly different
but round to get the same level name, how about putting something like
if (is.numeric(x)) x <- zapsmall(x)
somewhere at the start of the factor() function?
-p
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-devel
mailing list