# [Rd] suggestion for extending ?as.factor

Petr Savicky savicky at cs.cas.cz
Tue May 5 09:48:20 CEST 2009

```On Mon, May 04, 2009 at 07:28:06PM +0200, Peter Dalgaard wrote:
> Petr Savicky wrote:
> > For this, we get
> >
> >   > convert(0.3)
> >    "0.3"
> >   > convert(1/3)
> >    "0.3333333333333333" # 16 digits suffice
> >   > convert(0.12345)
> >    "0.12345"
> >   > convert(0.12345678901234567)
> >    "0.12345678901234566"
> >   > 0.12345678901234567 == as.numeric("0.12345678901234566")
> >    TRUE
> >
> > This algorithm is slower than a single call to sprintf("%.17g", x), but it
> > produces nicer numbers, if possible, and guarantees that the value is
> > always preserved.
>
> Yes, but....
>
> > convert(0.2+0.1)
>  "0.30000000000000004"

I am not sure whether this should be considered an error. Computing with decimal
numbers requires some care. If we want to get the result, which is the closest
to 0.3, it is better to use round(0.1+0.2, digits=1).

> I think that the real issue is that we actually do want almost-equal
> numbers to be folded together.

Yes, this is frequently needed. A related question of an approximate match
in a numeric sequence was discussed in R-devel thread "Match .3 in a sequence"
from March. In order to fold almost-equal numbers together, we need to specify
a tolerance to use. The tolerance depends on application. In my opinion, it is
hard to choose a tolerance, which could be a good default.

> The most relevant case I can conjure up
> is this (permutation testing):
>
> > zz <- replicate(20000,sum(sample(sleep\$extra,10)))
> > length(table(zz))
>  427
> > length(table(signif(zz,7)))
>  281

In order to obtain the correct result in this example, it is possible to use
zz <- signif(zz,7)
as you suggest or
zz <- round(zz, digits=1)
and use the resulting zz for all further calculations.

> Notice that the discrepancy comes from sums that really are identical
> values (in decimal arithmetic), but where the binary FP inaccuracy makes
> them slightly different.
>
> [for a nice picture, continue the example with
>
> > tt <- table(signif(zz,7))
> > plot(as.numeric(names(tt)),tt, type="h")

The form of this picture is not due to rounding errors. The picture may be
obtained even within an integer arithmetic as follows.

ss <- round(10*sleep\$extra)
zz <- replicate(20000,sum(sample(ss,10)))
tt <- table(zz)
plot(as.numeric(names(tt)),tt, type="h")

The variation of the frequencies is due to two effects.

First, each individual value of the sum occurs with low probability, so 20000
replications do not suffice to get low variance of individual frequencies. Using
1000 repetitions of the code above, i obtained estimate of some of the probabilities.
The most frequent sums have probability approximately p=0.0089 for a single sample.
With n=20000 replications, we get the mean frequency p*n = 178 and standard deviation
sqrt(p*(1-p)*n) = 13.28216.

The other cause of variation of the frequencies is that even the true distribution of
the sums has a lot of local minima and maxima. The mean of 1000 repetitions of the above
table restricted to values of the sum in the interval 140:168 produced the estimates

value mean frequency (over 1000 tables)
140   172.411
141   172.090
142   174.297
143   166.039
144   159.260
145   163.891
146   162.317
147   165.460
148   177.870
149   177.971
150   177.754
151   178.525 local maximum
152   169.851
153   164.443 local minimum
154   168.488 the mean value of the sum
155   164.816 local minimum
156   169.297
157   179.248 local maximum
158   177.799
159   176.743
160   177.777
161   164.173
162   162.585
163   164.641
164   159.913
165   165.932
166   173.014
167   172.276
168   171.612
The local minima and maxima are visible here. The mean value 154 is approximately
the center of the histogram.

Petr.

```