[Rd] Are r2dtable and C_r2dtable behaving correctly?

Fri Aug 25 12:04:58 CEST 2017

> On 25 Aug 2017, at 11:23 , Jari Oksanen <jari.oksanen at oulu.fi> wrote:
> 
> It is not about "really arge total number of observations", but:
> 
> set.seed(4711);tabs <- r2dtable(1e6, c(2, 2), c(2, 2)); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1));table(A11)
> 
> A11
>     0      1      2 
> 166483 666853 166664 
> 
> There are three possible matrices, and these come out in proportions 1:4:1, the one with all cells filled with ones being
> most common.

... and

> dhyper(0:2,2,2,2)
[1] 0.1666667 0.6666667 0.1666667
> dhyper(0:2,2,2,2) *6
[1] 1 4 1

so that is exactly what you would expect. However,

> dhyper(10782,209410, 276167, 25000)
[1] 0.005230889

so you wouldn't expect 10782 to recur. It is close to the mean of the hypergeometric distribution, but 

> sim <- rhyper(1e6,209410, 276167, 25000)
> mean(sim)
[1] 10781.53
> sd(sim)
[1] 76.14209

(and incidentally, rhyper is plenty fast enough that you don't really need r2dtable for the 2x2 case)

-pd

> 
> Cheers, Jari O.
> ________________________________________
> From: R-devel <r-devel-bounces at r-project.org> on behalf of Martin Maechler <maechler at stat.math.ethz.ch>
> Sent: 25 August 2017 11:30
> To: Gustavo Fernandez Bayon
> Cc: r-devel at r-project.org
> Subject: Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
> 
>>>>>> Gustavo Fernandez Bayon <gbayon at gmail.com>
>>>>>>    on Thu, 24 Aug 2017 16:42:36 +0200 writes:
> 
>> Hello,
>> While doing some enrichment tests using chisq.test() with simulated
>> p-values, I noticed some strange behaviour. The computed p-value was
>> extremely small, so I decided to dig a little deeper and debug
>> chisq.test(). I noticed then that the simulated statistics returned by the
>> following call
> 
>> tmp <- .Call(C_chisq_sim, sr, sc, B, E)
> 
>> were all the same, very small numbers. This, at first, seemed strange to
>> me. So I decided to do some simulations myself, and started playing around
>> with the r2dtable() function. Problem is, using my row and column
>> marginals, r2dtable() always returns the same matrix. Let's provide a
>> minimal example:
> 
>> rr <- c(209410, 276167)
>> cc <- c(25000, 460577)
>> ms <- r2dtable(3, rr, cc)
> 
>> I have tested this code in two machines and it always returned the same
>> list of length three containing the same matrix three times. The repeated
>> matrix is the following:
> 
>> [[1]]
>> [,1]   [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
> 
>> [[2]]
>> [,1]   [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
> 
>> [[3]]
>> [,1]   [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
> 
> Yes.  You can also do
> 
>   unique(r2dtable(100, rr, cc))
> 
> and see that the result is constant.
> 
> I'm pretty sure this is still due to some integer overflow,
> 
> in spite of the fact that I had spent quite some time to fix
> such problem in Dec 2003, see the 14 years old bug PR#5701
>  https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2
> 
> It has to be said that this is based on an algorithm published
> in 1981, specifically - from  help(r2dtable) -
> 
>     Patefield, W. M. (1981) Algorithm AS159.  An efficient method of
>     generating r x c tables with given row and column totals.
>     _Applied Statistics_ *30*, 91-97.
> 
>   For those with JSTOR access (typically via your University),
>   available at http://www.jstor.org/stable/2346669
> 
> When I start reading it, indeed the algorithm seems start from the
> expected value of a cell entry and then "explore from there"...
> and I do wonder if there is not a flaw somewhere in the
> algorithm:
> 
> I've now found that a bit more than a year ago, 'paljenczy' found on SO
>  https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
> that indeed the generated tables seem to be too much around the mean.
> Basically his example:
> 
> https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
> 
> 
>> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))
>   user  system elapsed
>  0.218   0.025   0.244
>> table(A11)
> 
>    34     35     36     37     38     39     40     41     42     43
>     2     17     40    129    334    883   2026   4522   8766  15786
>    44     45     46     47     48     49     50     51     52     53
> 26850  42142  59535  78851  96217 107686 112438 108237  95761  78737
>    54     55     56     57     58     59     60     61     62     63
> 59732  41474  26939  16006   8827   4633   2050    865    340    116
>    64     65     66     67
>    38     13      7      1
>> 
> 
> For a  2x2  table, there's really only one degree of freedom,
> hence the above characterizes the full distribution for that
> case.
> 
> I would have expected to see all possible values in  0:100
> instead of such a "normal like" distribution with carrier only
> in [34, 67].
> 
> There are newer publications and maybe algorithms.
> So maybe the algorithm is "flawed by design" for really large
> total number of observations, rather than wrong
> Seems interesting ...
> 
> Martin Maechler
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com