[R] Query:chi-squre test

Mon Jul 10 16:28:12 CEST 2006

"priti desai" <priti.desai at kalyptorisk.com> writes:

> Hi,
>   I have calculated chi-square goodness of fit test,Sample coming from
> Poisson distribution.
> please copy this script in R & run the script
> The R script is as follows
> 
> ########################## start
> #########################################
> 
> No_of_Frauds<-
> c(4,1,6,9,9,10,2,4,8,2,3,0,1,2,3,1,3,4,5,4,4,4,9,5,4,3,11,8,12,3,10,0,7)
> 
> 
> 
> lambda<- mean(No_of_Frauds)
>  
> 
> # Chi-Squared Goodness of Fit Test
> 
> # Ho: The data follow a specified distribution Vs H1: Not Ho
> 
> # observed frequencies 
> 
> variable.cnts <- table(No_of_Frauds) 
> variable.cnts
> 
> variable.cnts.prs <- dpois(as.numeric(names(variable.cnts)), lambda)
> variable.cnts.prs
> 
> variable.cnts <- c(variable.cnts, 0) 
> variable.cnts
> variable.cnts.prs <- c(variable.cnts.prs, 1-sum(variable.cnts.prs))
> variable.cnts.prs
> 
> tst <- chisq.test(variable.cnts, p=variable.cnts.prs)
> Tst
> 
> ######################### end ########################################
> 
> 
> The result of R is as follows
> 
> Warning message:
> Chi-squared approximation may be incorrect in: chisq.test(variable.cnts,
> p = variable.cnts.prs) 
> > tst
> 
>         Chi-squared test for given probabilities
> 
> data:  variable.cnts 
> X-squared = 40.5614, df = 13, p-value = 0.0001122
> 
> 
> But I have done calculations in Excel. I am getting different answer.
> 
> Observed  = 2,3,3,5,7,2,1,1,2,3,2,1,1,0
> Expected=0.251005528,1.224602726,2.987288468,4.85811559,5.925428863,5.78
> 1782103,4.701348074,3.276697142,1.998288788,1.083247457,0.528493456,0.23
> 4400679,0.095299266,0.035764993
> 
> 
>  Estimated Parameter  =4.878788
> 
> Chi square stat =  0.000113
> 
> 
> My excel answer tally with the book which I have refer for excel.   
> Please tell me the correct calculation in R.
> And how to interprit the results in R.

As far as I can see, the "Chi square stat" in Excel is essentially the
p-value in R. The slight difference appears to arise from Excel using
the point probability rather than the tail ditto in the last cell:

> O <- c(2,3,3,5,7,2,1,1,2,3,2,1,1,0)
> E <-  c(0.251005528,1.224602726,2.987288468,4.85811559,5.925428863,
+ 5.781782103,4.701348074,3.276697142,1.998288788,1.083247457,0.528493456,
+ 0.234400679,0.095299266,0.035764993)
> (O-E)^2/E
 [1] 1.218691e+01 2.573925e+00 5.409021e-05 4.143826e-03 1.948725e-01
 [6] 2.473610e+00 2.914053e+00 1.581883e+00 1.465377e-06 3.391598e+00
[11] 4.097178e+00 2.500600e+00 8.588560e+00 3.576499e-02
> sum((O-E)^2/E)
[1] 40.54315
> pchisq(sum((O-E)^2/E), 13,low=F)
[1] 0.0001129818
> E
 [1] 0.25100553 1.22460273 2.98728847 4.85811559 5.92542886 5.78178210
 [7] 4.70134807 3.27669714 1.99828879 1.08324746 0.52849346 0.23440068
[13] 0.09529927 0.03576499
> sum(E)
[1] 32.98176

Please don't assume that something is correct, just because it is
Excel output and some book mindlessly copied it...

The calculations are both wrong, because they ignore the fact that
lambda has been estimated from the data, and also because they deal
with very small expected cell counts. For a better test, you likely
need to simulate the distribution of the chi-square, or, as I'd be
inclined to do, go directly for the pretty obvious overdispersion:

> var(X)
[1] 11.17235
> var(X)/mean(X) # expected is ca. 1 in the Poisson distrib.
[1] 2.289984
> r <- replicate(100000,{X <- rpois(33, 4.87879); var(X)/mean(X)})
> sum(r > 2.289984)
[1] 5

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907