[Rd] suggestion for extending ?as.factor

Tue May 5 11:27:36 CEST 2009

Petr Savicky wrote:

> 
>> Notice that the discrepancy comes from sums that really are identical
>> values (in decimal arithmetic), but where the binary FP inaccuracy makes
>> them slightly different.
>>
>> [for a nice picture, continue the example with
>>
>>> tt <- table(signif(zz,7))
>>> plot(as.numeric(names(tt)),tt, type="h")
> 
> The form of this picture is not due to rounding errors. The picture may be
> obtained even within an integer arithmetic as follows.
> 
>   ss <- round(10*sleep$extra)
>   zz <- replicate(20000,sum(sample(ss,10)))
>   tt <- table(zz)
>   plot(as.numeric(names(tt)),tt, type="h")

I know. The point was rather that if you are not careful with rounding,
you get the some of the bars wrong (you get 2 or 3 small bars very close
to each other instead of one longer one). Computed p values from
permutation tests (as in mean(sim>=obs)) also need care for the same reason.

> 
> The variation of the frequencies is due to two effects.
> 
> First, each individual value of the sum occurs with low probability, so 20000
....

> 
> The other cause of variation of the frequencies is that even the true distribution of
> the sums has a lot of local minima and maxima. 

Yes. You can actually generate the exact distribution easily using

d <- combn(sleep$extra, 10, sum)
d <- signif(d,7)
tt <- table(d)
plot(as.numeric(names(tt)),tt, type="h")

and if you omit the signif() bit (not with R-devel):

> table(table(names(table(d))))

  1   2   3
137 161  17

i.e. 315 distinct values but over half occur in duplicate or triplicate
versions.

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907