# [R] Problem with Poisson - Chi Square Goodness of Fit Test - New Mail

Fri Aug 29 12:02:42 CEST 2008

Dear R-help,

Chi Square Test for Goodness of Fit

I have got a discrete data
as given below (R script)

No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3)

I am trying to fit Poisson
distribution to this data using R.

My R script is as under :

________________________________________________________

# R SCRIPT for Fitting
Poisson Distribution

No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3)

N              <-             length(No_of_Frauds)

Average     <-             mean(No_of_Frauds)

Lambda     <-             Average

i               <-             c(0:(N-1))

pmf           <-             dpois(i, Lambda, log = FALSE)

#
----------------------------------------------------------------------------

# Ho: The data follow Poisson
Distribution Vs H1: Not Ho

# observed frequencies (Oi)

variable.cnts
<-     table(No_of_Frauds)

variable.cnts.prs
<-     dpois(as.numeric(names(variable.cnts)),
lambda)

variable.cnts
<-     c(variable.cnts, 0)

variable.cnts.prs <-     c(variable.cnts.prs,
1-sum(variable.cnts.prs))

tst
<-     chisq.test(variable.cnts,
p=variable.cnts.prs)

chi_squared
<-     as.numeric(unclass(tst)\$statistic)

p_value             <-     as.numeric(unclass(tst)\$p.value)

df
<-     tst[2]\$parameter

cv1                    <-     qchisq(p=.01, df=tst[2]\$parameter, lower.tail = FALSE, log.p =
FALSE)

cv2                    <-     qchisq(p=.05, df=tst[2]\$parameter, lower.tail = FALSE, log.p =
FALSE)

cv3                    <-     qchisq(p=.1, df=tst[2]\$parameter, lower.tail = FALSE, log.p =
FALSE)

#-----------------------------------------------------------------------------

# Expected value

# variable.cnts.prs *
sum(variable.cnts)

#
if tst > cv reject Ho at alpha confidence level

#-----------------------------------------------------------------------------

if(chi_squared > cv1)

Conclusion1 <- 'Sample
does not come from the postulated probability distribution at 1% los' else

Conclusion1 <- 'Sample
comes from postulated prob. distribution at 1% los'

if(chi_squared > cv2)

Conclusion2 <- 'Sample
does not come from the postulated probability distribution at 5% los' else

Conclusion2 <- 'Sample
comes from postulated prob. distribution at 1% los'

if(chi_squared > cv3)

Conclusion3 <- 'Sample
does not come from the postulated probability distribution at 10% los' else

Conclusion3 <- 'Sample
come from postulated prob distribution at 1% los'

#-----------------------------------------------------------------------------

# Printing RESULTS

print(chi_squared)

print(p_value)

print(df)

print(cv1)

print(cv2)

print(cv3)

print(Conclusion1)

print(Conclusion2)

print(Conclusion3)

##### End of R Script
########

________________________________________________________

Problem Faced :

When I run this script using
R – console,

I am getting value of Chi – Square Statistics as
high as “6.95753e+37”

When I did the same calculations in Excel, I got
the Chi Square Statistics value = 138.34.

Although it is clear that the sample data doesn’t
follow Poisson distribution, and I will have to look for other discrete
distribution, my problem is the HIGH Value of Chi Square test statistics. When
I analyzed further, I understood the problem.

(A) By convention, if your Expected
frequency is less than 5, then by we put together such classes and form a new
class such that Expected frequency is greater than 5 and also accordingly

X

Oi

Ei

((Oi - Ei)^2)/Ei

0

0

10

9.96

1

72

23

103.79

2

17

27

3.54

3

5

21

11.85

4

3

12

6.71

5

4

9

2.51

Total

101

101

138.34

When I apply this logic in Excel, I am getting the
reasonable result (i.e. 138.34), however in Excel also, if I don’t apply this
logic, my Chi square test statistic value is as high as 4.70043E+37.

My
question is how do I modify my R – script, so that the logic mentioned in (A)
i.e. adjusting the Expected frequencies (and accordingly Observed frequencies) is
applied so that the expected frequency becomes greater than 5 for a given
class, thereby resulting in reasonable value of Chi Square test Statistics.

I am also attaching the xls file for ready
reference.

I sincerely apologize for taking liberty of writing
such a long mail and since I am very new to this “R language” can someone help
me out.

Ashok (Mumbai,
India)