[R] family question

Wed Aug 30 23:19:06 CEST 2000

ben at zoo.ufl.edu writes:

>   Seems like a very odd setup but it does seem as though the code would do
> what you wanted.  Also seems as though you could do some combinatorics to
> get the answer much more efficiently if you wanted to ... (i.e., write
> down the probabilities of having families that were (1 boy, 0 girl = 50%)
> (2 boy, 1 girl = dbinom(2,3,0.5) = 3/8), (3 boys, 2 girls), (4 boys, 3
> girls), etc., and then divide by the sum.  Since you add children one at a
> time and stop when sum(n1)>sum(n2), you will always have one more boy than
> girl in the family ...) [That says, by the way, that you won't get equal
> numbers of boys and girls.  You will get between 100% and 50% boys, closer
> to 100%].

Actually, no.  If you have a large number of families you will see, as
you describe below, some extremely large families.  I simulated
another 10000 families with a maximum family size of 20000 and got 55
that would have execeded the maximum size.  The trick on the
population proportions is that each one of those families contributes
10000 girls whereas the approximately 5000 families of size 1 only
contribute a total of 5000 boys to the population.  The proportion in
the population starts to look like the proportion in the extremely
large families, which have nearly equal numbers of boys and girls.

>   However, after some playing (see slightly augmented code below),
> it also seems clear that there's something Troels and I are both
> missing, or a bug in the rbinom code (??) [or a failure of low-order
> bits to be uncorrelated?]; the distribution of family sizes has some
> odd characteristics (all family sizes are odd, as expected, but
> there are more large families -- and some that reach the maximum,
> which is unexpected).

The large families are legitimate occurences in the data.  I'm not
sufficiently skilled in counting problems to give a complete
explanation of what is going on but basically there are two forces at
work in determining the probability of large families.  The
probability of a given sequence yielding 2*K+1 children decreases
geometrically with K but the number of possible legitimate
combinations yielding that size increases combinatorially.  As Duncan
Murdoch said, the probabilities get messy.  There is only one pattern
for families of size 1 and 3 but there are two patterns for a family
of size 5 and more (5, I think) for a family of size 7

  Prob   Size  Pattern
   1/2     1   B
   1/8     3   G B B
   1/32    5   G B G B B
   1/32    5   G G B B B
   1/128   7   G B G B G B B
   1/128   7   G B G G B B B
   1/128   7   G G B B G B B
   1/128   7   G G B G B B B
   1/128   7   G G G B B B B

These results seem to be confirmed by simulation

> table(famsz)   # simulation of 10000 families
famsz
         1          3          5          7          9         11         13 
      5018       1223        624        364        265        186        188 
...

> 2 * (1/32)
[1] 0.0625
> 5 * (1/128)
[1] 0.0390625

Overall the probability of a given size of family decreases with the
size but not quickly enough to rule out very large families.

This creates an interesting programming problem.  Generally we want to
work on the "whole object" rather than running loops over elements,
which is why I used a vector of 0/1 results that "should be"
sufficiently long and found the minimum family size satisfying the
condition.

> famsz <- integer(10000)
> ind <- 1:20000
> for (i in seq(along = famsz)) famsz[i] <- 
+  min(ind[cumsum(ifelse(runif(length(ind)) > 0.5, 1, 0)) > (ind/2)])
There were 50 or more warnings (use warnings() to see the first 50)

This has the advantage of only one call to the random number generator
for each family and vectorizing certain operations but it is doing
20000 times as much work as necessary half the time!  A more extreme
version would generate a matrix of size 20000 by 10000 filled with 0/1
values and apply a function to the columns to generate the families
but that gets you into a space/time tradeoff.  The matrix would have 2
* 10^8 entries and I don't have that much memory.

An alternative is to run the loop as many times as necessary, as you
and Troels were doing.  I would re-write that as

> sz <- integer(1); boys <- integer(1)  # make sure they are stored as integers
> for (i in seq(along = famsz)) {
+    sz[] <- boys[] <- 0     # overwrite rather than assign new storage
+    repeat {
+       boys[] <- boys + (runif(1) > 0.5)
+       sz[] <- sz + 1
+       if (2 * boys > sz) break
+    }
+    famsz[i] <- sz
+ }

The overwriting is done to avoid having to garbage collect although I
imagine that will not be the limiting factor here.  With this code,
those large families take a very long time, perhaps approaching an
infinite loop.  I have had it running on a machine for an hour and it
has not finished yet.

I'll be interested in solutions that others can propose.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._