[Rd] suggestion for extending ?as.factor
Petr Savicky
savicky at cs.cas.cz
Tue May 5 09:48:20 CEST 2009
On Mon, May 04, 2009 at 07:28:06PM +0200, Peter Dalgaard wrote:
> Petr Savicky wrote:
> > For this, we get
> >
> > > convert(0.3)
> > [1] "0.3"
> > > convert(1/3)
> > [1] "0.3333333333333333" # 16 digits suffice
> > > convert(0.12345)
> > [1] "0.12345"
> > > convert(0.12345678901234567)
> > [1] "0.12345678901234566"
> > > 0.12345678901234567 == as.numeric("0.12345678901234566")
> > [1] TRUE
> >
> > This algorithm is slower than a single call to sprintf("%.17g", x), but it
> > produces nicer numbers, if possible, and guarantees that the value is
> > always preserved.
>
> Yes, but....
>
> > convert(0.2+0.1)
> [1] "0.30000000000000004"
I am not sure whether this should be considered an error. Computing with decimal
numbers requires some care. If we want to get the result, which is the closest
to 0.3, it is better to use round(0.1+0.2, digits=1).
> I think that the real issue is that we actually do want almost-equal
> numbers to be folded together.
Yes, this is frequently needed. A related question of an approximate match
in a numeric sequence was discussed in R-devel thread "Match .3 in a sequence"
from March. In order to fold almost-equal numbers together, we need to specify
a tolerance to use. The tolerance depends on application. In my opinion, it is
hard to choose a tolerance, which could be a good default.
> The most relevant case I can conjure up
> is this (permutation testing):
>
> > zz <- replicate(20000,sum(sample(sleep$extra,10)))
> > length(table(zz))
> [1] 427
> > length(table(signif(zz,7)))
> [1] 281
In order to obtain the correct result in this example, it is possible to use
zz <- signif(zz,7)
as you suggest or
zz <- round(zz, digits=1)
and use the resulting zz for all further calculations.
> Notice that the discrepancy comes from sums that really are identical
> values (in decimal arithmetic), but where the binary FP inaccuracy makes
> them slightly different.
>
> [for a nice picture, continue the example with
>
> > tt <- table(signif(zz,7))
> > plot(as.numeric(names(tt)),tt, type="h")
The form of this picture is not due to rounding errors. The picture may be
obtained even within an integer arithmetic as follows.
ss <- round(10*sleep$extra)
zz <- replicate(20000,sum(sample(ss,10)))
tt <- table(zz)
plot(as.numeric(names(tt)),tt, type="h")
The variation of the frequencies is due to two effects.
First, each individual value of the sum occurs with low probability, so 20000
replications do not suffice to get low variance of individual frequencies. Using
1000 repetitions of the code above, i obtained estimate of some of the probabilities.
The most frequent sums have probability approximately p=0.0089 for a single sample.
With n=20000 replications, we get the mean frequency p*n = 178 and standard deviation
sqrt(p*(1-p)*n) = 13.28216.
The other cause of variation of the frequencies is that even the true distribution of
the sums has a lot of local minima and maxima. The mean of 1000 repetitions of the above
table restricted to values of the sum in the interval 140:168 produced the estimates
value mean frequency (over 1000 tables)
140 172.411
141 172.090
142 174.297
143 166.039
144 159.260
145 163.891
146 162.317
147 165.460
148 177.870
149 177.971
150 177.754
151 178.525 local maximum
152 169.851
153 164.443 local minimum
154 168.488 the mean value of the sum
155 164.816 local minimum
156 169.297
157 179.248 local maximum
158 177.799
159 176.743
160 177.777
161 164.173
162 162.585
163 164.641
164 159.913
165 165.932
166 173.014
167 172.276
168 171.612
The local minima and maxima are visible here. The mean value 154 is approximately
the center of the histogram.
Petr.
More information about the R-devel
mailing list