[R] Chi-squared approximation may be incorrect in: chisq.tes
(Ted Harding)
Ted.Harding at nessie.mcc.ac.uk
Sat Dec 2 18:42:20 CET 2006
On 02-Dec-06 Ethan Johnsons wrote:
> I am getting "Chi-squared approximation may be incorrect in:
> chisq.test(x)" with the data bleow.
>
> Frequency distribution of number of male offspring in families of size
> 5.
> Number of Male Offspring N
> 0 518
> 1 2245
> 2 4621
> 3 4753
> 4 2476
> 5 549
> Total 15162
>
>
>> x=matrix(c(0,1,2,3,4,5,518,2245,4621,4753,2476,549), ncol=2)
>> x
> [,1] [,2]
> [1,] 0 518
> [2,] 1 2245
> [3,] 2 4621
> [4,] 3 4753
> [5,] 4 2476
> [6,] 5 549
>> chisq.test(x)
>
> Pearson's Chi-squared test
>
> data: x
> X-squared = 40.4672, df = 5, p-value = 1.202e-07
>
> Warning message:
> Chi-squared approximation may be incorrect in: chisq.test(x)
>
> Does the warning mean that chisq.test(x) can't be used for this data.
>
> If chisq.test(x) can't be used for this data, which test is to be used?
You are feeding a matrix with 6 rows and *2 columns* to chisq.test()
See "?chisq.test":
If 'x' is a matrix with at least two rows and columns, it is taken
as a two-dimensional contingency table, and hence its entries
should be nonnegative integers.
In other words, the first column -- giving the numbers (0-5)
of male offspring -- is also being used as count data, so what
you're getting is a test of association between the two sets
of "counts" {0,1,2,3,4,5} and {518,2245,4621,4753,2476,549}!
So, if (which seems likely) you want to test the counts in
the second column against (say) a binomial distribution,
you could use the following code (assuming P[boy]=0.5)
chisq.test(x[,2],p=dbinom(x[,1],5,0.5))
## Chi-squared test for given probabilities
## data: x[, 2]
## X-squared = 30.3181, df = 5, p-value = 1.277e-05
so it seems a binomial with P=0.5 won't do. Hence next try
the best estimate of P assuming binomial:
P<-sum((0:5)*x[,2])/(5*sum(x[,2]))
P
## [1] 0.5064635
So does this improve things?:
chisq.test(x[,2],p=dbinom(x[,1],5,P))
## Chi-squared test for given probabilities
## data: x[, 2]
## X-squared = 17.744, df = 5, p-value = 0.003285
well, quite a bit -- but of course chisq.test() uses the wrong
degrees of freedom (should be 4, since we've estimated P this
time), so actually it's worse than that:
## 1-pchisq(17.744,4)
## [1] 0.001384664
though at 1.4e-3 it's still better than 1.3e-5!
Nevertheless, now that we've got the best-fitting binomial model,
the hypothesis that the numbers of male children in families of
size 5 is binomially distributed is clearly implausible.
Sufficient conditions for binomial distribution are
a) The genders of different births in the same family are
independent of each other
b) The probability of "boy" is the same for all families.
Hence either (a) or (b) [or both] fails for these data.
I don't know of any way to force chisq.test() itself to take
account of degrees of freedom used up in estimation. chisq.test()
is a test for contingency tables (including a 1-row table tested
against a fixed given probability distribution).
Anyway, an indication of how the data fail to follow the best fitting
binomial distribution can be obtained by looking at
X2T<-chisq.test(x[,2],p=dbinom(x[,1],5,P))
cbind((0:5),X2T$expected,X2T$observed)
## [,1] [,2] [,3]
## [1,] 0 443.9691 518
## [2,] 1 2277.9893 2245
## [3,] 2 4675.3120 4621
## [4,] 3 4797.7711 4753
## [5,] 4 2461.7188 2476
## [6,] 5 505.2396 549
showing (at least) that there seem to be excess counts in
the "0" and "5" classes. This can be further clarified with
n.E<-X2T$expected; n.O<-X2T$observed; Dev<-(n.O-n.E)/sqrt(n.E)
cbind((0:5),X2T$expected,X2T$observed,Dev)
## Dev
## [1,] 0 443.9691 518 3.5134726
## [2,] 1 2277.9893 2245 -0.6911901
## [3,] 2 4675.3120 4621 -0.7943114
## [4,] 3 4797.7711 4753 -0.6463652
## [5,] 4 2461.7188 2476 0.2878353
## [6,] 5 505.2396 549 1.9468512
sum(Dev^2)
## [1] 17.74403
[agreeing with the 17.744 from chisq.test()]
Hence there seems to be a clear excess of all-girl families,
and some excess of all-boy families, but nothing obviously
remarkable about the others (which does not of course exclude
that there is a dependency between the genders of successive
births in these too).
Quite how you proceed further depends on how you wish to model
this dependency. See VGAM package for betabin.ab(), which fits
a beta-binomial (binomial where the probability P varies from
case [family] to case according to a beta-distribution), for
instance.
Hoping this helps,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 02-Dec-06 Time: 17:42:16
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list