[Rd] Are r2dtable and C_r2dtable behaving correctly?
Peter Dalgaard
pdalgd at gmail.com
Fri Aug 25 12:04:58 CEST 2017
> On 25 Aug 2017, at 11:23 , Jari Oksanen <jari.oksanen at oulu.fi> wrote:
>
> It is not about "really arge total number of observations", but:
>
> set.seed(4711);tabs <- r2dtable(1e6, c(2, 2), c(2, 2)); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1));table(A11)
>
> A11
> 0 1 2
> 166483 666853 166664
>
> There are three possible matrices, and these come out in proportions 1:4:1, the one with all cells filled with ones being
> most common.
... and
> dhyper(0:2,2,2,2)
[1] 0.1666667 0.6666667 0.1666667
> dhyper(0:2,2,2,2) *6
[1] 1 4 1
so that is exactly what you would expect. However,
> dhyper(10782,209410, 276167, 25000)
[1] 0.005230889
so you wouldn't expect 10782 to recur. It is close to the mean of the hypergeometric distribution, but
> sim <- rhyper(1e6,209410, 276167, 25000)
> mean(sim)
[1] 10781.53
> sd(sim)
[1] 76.14209
(and incidentally, rhyper is plenty fast enough that you don't really need r2dtable for the 2x2 case)
-pd
>
> Cheers, Jari O.
> ________________________________________
> From: R-devel <r-devel-bounces at r-project.org> on behalf of Martin Maechler <maechler at stat.math.ethz.ch>
> Sent: 25 August 2017 11:30
> To: Gustavo Fernandez Bayon
> Cc: r-devel at r-project.org
> Subject: Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
>
>>>>>> Gustavo Fernandez Bayon <gbayon at gmail.com>
>>>>>> on Thu, 24 Aug 2017 16:42:36 +0200 writes:
>
>> Hello,
>> While doing some enrichment tests using chisq.test() with simulated
>> p-values, I noticed some strange behaviour. The computed p-value was
>> extremely small, so I decided to dig a little deeper and debug
>> chisq.test(). I noticed then that the simulated statistics returned by the
>> following call
>
>> tmp <- .Call(C_chisq_sim, sr, sc, B, E)
>
>> were all the same, very small numbers. This, at first, seemed strange to
>> me. So I decided to do some simulations myself, and started playing around
>> with the r2dtable() function. Problem is, using my row and column
>> marginals, r2dtable() always returns the same matrix. Let's provide a
>> minimal example:
>
>> rr <- c(209410, 276167)
>> cc <- c(25000, 460577)
>> ms <- r2dtable(3, rr, cc)
>
>> I have tested this code in two machines and it always returned the same
>> list of length three containing the same matrix three times. The repeated
>> matrix is the following:
>
>> [[1]]
>> [,1] [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
>
>> [[2]]
>> [,1] [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
>
>> [[3]]
>> [,1] [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
>
> Yes. You can also do
>
> unique(r2dtable(100, rr, cc))
>
> and see that the result is constant.
>
> I'm pretty sure this is still due to some integer overflow,
>
> in spite of the fact that I had spent quite some time to fix
> such problem in Dec 2003, see the 14 years old bug PR#5701
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2
>
> It has to be said that this is based on an algorithm published
> in 1981, specifically - from help(r2dtable) -
>
> Patefield, W. M. (1981) Algorithm AS159. An efficient method of
> generating r x c tables with given row and column totals.
> _Applied Statistics_ *30*, 91-97.
>
> For those with JSTOR access (typically via your University),
> available at http://www.jstor.org/stable/2346669
>
> When I start reading it, indeed the algorithm seems start from the
> expected value of a cell entry and then "explore from there"...
> and I do wonder if there is not a flaw somewhere in the
> algorithm:
>
> I've now found that a bit more than a year ago, 'paljenczy' found on SO
> https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
> that indeed the generated tables seem to be too much around the mean.
> Basically his example:
>
> https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
>
>
>> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))
> user system elapsed
> 0.218 0.025 0.244
>> table(A11)
>
> 34 35 36 37 38 39 40 41 42 43
> 2 17 40 129 334 883 2026 4522 8766 15786
> 44 45 46 47 48 49 50 51 52 53
> 26850 42142 59535 78851 96217 107686 112438 108237 95761 78737
> 54 55 56 57 58 59 60 61 62 63
> 59732 41474 26939 16006 8827 4633 2050 865 340 116
> 64 65 66 67
> 38 13 7 1
>>
>
> For a 2x2 table, there's really only one degree of freedom,
> hence the above characterizes the full distribution for that
> case.
>
> I would have expected to see all possible values in 0:100
> instead of such a "normal like" distribution with carrier only
> in [34, 67].
>
> There are newer publications and maybe algorithms.
> So maybe the algorithm is "flawed by design" for really large
> total number of observations, rather than wrong
> Seems interesting ...
>
> Martin Maechler
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-devel
mailing list