[R] Chi-squared approximation may be incorrect in: chisq.tes

Sat Dec 2 18:42:20 CET 2006

On 02-Dec-06 Ethan Johnsons wrote:
> I am getting "Chi-squared approximation may be incorrect in:
> chisq.test(x)"  with the data bleow.
> 
> Frequency distribution of number of male offspring in families of size
> 5.
> Number of Male Offspring      N
>         0                           518
>         1                          2245
>         2                          4621
>         3                          4753
>         4                          2476
>         5                           549
> Total                             15162
> 
> 
>> x=matrix(c(0,1,2,3,4,5,518,2245,4621,4753,2476,549), ncol=2)
>> x
>      [,1] [,2]
> [1,]    0  518
> [2,]    1 2245
> [3,]    2 4621
> [4,]    3 4753
> [5,]    4 2476
> [6,]    5  549
>> chisq.test(x)
> 
>         Pearson's Chi-squared test
> 
> data:  x
> X-squared = 40.4672, df = 5, p-value = 1.202e-07
> 
> Warning message:
> Chi-squared approximation may be incorrect in: chisq.test(x)
> 
> Does the warning mean that chisq.test(x) can't be used for this data.
> 
> If chisq.test(x) can't be used for this data, which test is to be used?

You are feeding a matrix with 6 rows and *2 columns* to chisq.test()

See "?chisq.test":

  If 'x' is a matrix with at least two rows and columns, it is taken
  as a two-dimensional contingency table, and hence its entries
  should be nonnegative integers.

In other words, the first column -- giving the numbers (0-5)
of male offspring -- is also being used as count data, so what
you're getting is a test of association between the two sets
of "counts" {0,1,2,3,4,5} and {518,2245,4621,4753,2476,549}!

So, if (which seems likely) you want to test the counts in
the second column against (say) a binomial distribution,
you could use the following code (assuming P[boy]=0.5)

  chisq.test(x[,2],p=dbinom(x[,1],5,0.5))
  ##    Chi-squared test for given probabilities
  ##    data:  x[, 2] 
  ##    X-squared = 30.3181, df = 5, p-value = 1.277e-05

so it seems a binomial with P=0.5 won't do. Hence next try
the best estimate of P assuming binomial:

  P<-sum((0:5)*x[,2])/(5*sum(x[,2]))
  P
  ##  [1] 0.5064635

So does this improve things?:

  chisq.test(x[,2],p=dbinom(x[,1],5,P))
  ##    Chi-squared test for given probabilities
  ##    data:  x[, 2] 
  ##    X-squared = 17.744, df = 5, p-value = 0.003285

well, quite a bit -- but of course chisq.test() uses the wrong
degrees of freedom (should be 4, since we've estimated P this
time), so actually it's worse than that:

  ## 1-pchisq(17.744,4)
  ## [1] 0.001384664

though at 1.4e-3 it's still better than 1.3e-5!

Nevertheless, now that we've got the best-fitting binomial model,
the hypothesis that the numbers of male children in families of
size 5 is binomially distributed is clearly implausible.

Sufficient conditions for binomial distribution are
a) The genders of different births in the same family are
   independent of each other
b) The probability of "boy" is the same for all families.
Hence either (a) or (b) [or both] fails for these data.

I don't know of any way to force chisq.test() itself to take
account of degrees of freedom used up in estimation. chisq.test()
is a test for contingency tables (including a 1-row table tested
against a fixed given probability distribution).

Anyway, an indication of how the data fail to follow the best fitting
binomial distribution can be obtained by looking at

  X2T<-chisq.test(x[,2],p=dbinom(x[,1],5,P))
  cbind((0:5),X2T$expected,X2T$observed)
  ##       [,1]      [,2] [,3]
  ##  [1,]    0  443.9691  518
  ##  [2,]    1 2277.9893 2245
  ##  [3,]    2 4675.3120 4621
  ##  [4,]    3 4797.7711 4753
  ##  [5,]    4 2461.7188 2476
  ##  [6,]    5  505.2396  549

showing (at least) that there seem to be excess counts in
the "0" and "5" classes. This can be further clarified with

  n.E<-X2T$expected; n.O<-X2T$observed; Dev<-(n.O-n.E)/sqrt(n.E)
  cbind((0:5),X2T$expected,X2T$observed,Dev)
  ##                               Dev
  ##  [1,] 0  443.9691  518  3.5134726
  ##  [2,] 1 2277.9893 2245 -0.6911901
  ##  [3,] 2 4675.3120 4621 -0.7943114
  ##  [4,] 3 4797.7711 4753 -0.6463652
  ##  [5,] 4 2461.7188 2476  0.2878353
  ##  [6,] 5  505.2396  549  1.9468512

  sum(Dev^2)
  ##  [1] 17.74403
[agreeing with the 17.744 from chisq.test()]

Hence there seems to be a clear excess of all-girl families,
and some excess of all-boy families, but nothing obviously
remarkable about the others (which does not of course exclude
that there is a dependency between the genders of successive
births in these too).

Quite how you proceed further depends on how you wish to model
this dependency. See VGAM package for betabin.ab(), which fits
a beta-binomial (binomial where the probability P varies from
case [family] to case according to a beta-distribution), for
instance.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 02-Dec-06                                       Time: 17:42:16
------------------------------ XFMail ------------------------------