[Rd] suggestion for extending ?as.factor
Peter Dalgaard
P.Dalgaard at biostat.ku.dk
Tue May 5 11:27:36 CEST 2009
Petr Savicky wrote:
>
>> Notice that the discrepancy comes from sums that really are identical
>> values (in decimal arithmetic), but where the binary FP inaccuracy makes
>> them slightly different.
>>
>> [for a nice picture, continue the example with
>>
>>> tt <- table(signif(zz,7))
>>> plot(as.numeric(names(tt)),tt, type="h")
>
> The form of this picture is not due to rounding errors. The picture may be
> obtained even within an integer arithmetic as follows.
>
> ss <- round(10*sleep$extra)
> zz <- replicate(20000,sum(sample(ss,10)))
> tt <- table(zz)
> plot(as.numeric(names(tt)),tt, type="h")
I know. The point was rather that if you are not careful with rounding,
you get the some of the bars wrong (you get 2 or 3 small bars very close
to each other instead of one longer one). Computed p values from
permutation tests (as in mean(sim>=obs)) also need care for the same reason.
>
> The variation of the frequencies is due to two effects.
>
> First, each individual value of the sum occurs with low probability, so 20000
....
>
> The other cause of variation of the frequencies is that even the true distribution of
> the sums has a lot of local minima and maxima.
Yes. You can actually generate the exact distribution easily using
d <- combn(sleep$extra, 10, sum)
d <- signif(d,7)
tt <- table(d)
plot(as.numeric(names(tt)),tt, type="h")
and if you omit the signif() bit (not with R-devel):
> table(table(names(table(d))))
1 2 3
137 161 17
i.e. 315 distinct values but over half occur in duplicate or triplicate
versions.
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-devel
mailing list