[R] Problem with Poisson - Chi Square Goodness of Fit Test - New Mail
Madhavi Bhave
madhavi_bhave at yahoo.com
Fri Aug 29 12:02:42 CEST 2008
Dear R-help,
Chi Square Test for Goodness of Fit
I have got a discrete data
as given below (R script)
No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3)
I am trying to fit Poisson
distribution to this data using R.
My R script is as under :
________________________________________________________
# R SCRIPT for Fitting
Poisson Distribution
No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3)
N <- length(No_of_Frauds)
Average <- mean(No_of_Frauds)
Lambda <- Average
i <- c(0:(N-1))
pmf <- dpois(i, Lambda, log = FALSE)
#
----------------------------------------------------------------------------
# Ho: The data follow Poisson
Distribution Vs H1: Not Ho
# observed frequencies (Oi)
variable.cnts
<- table(No_of_Frauds)
variable.cnts.prs
<- dpois(as.numeric(names(variable.cnts)),
lambda)
variable.cnts
<- c(variable.cnts, 0)
variable.cnts.prs <- c(variable.cnts.prs,
1-sum(variable.cnts.prs))
tst
<- chisq.test(variable.cnts,
p=variable.cnts.prs)
chi_squared
<- as.numeric(unclass(tst)$statistic)
p_value <- as.numeric(unclass(tst)$p.value)
df
<- tst[2]$parameter
cv1 <- qchisq(p=.01, df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)
cv2 <- qchisq(p=.05, df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)
cv3 <- qchisq(p=.1, df=tst[2]$parameter, lower.tail = FALSE, log.p =
FALSE)
#-----------------------------------------------------------------------------
# Expected value
# variable.cnts.prs *
sum(variable.cnts)
#
if tst > cv reject Ho at alpha confidence level
#-----------------------------------------------------------------------------
if(chi_squared > cv1)
Conclusion1 <- 'Sample
does not come from the postulated probability distribution at 1% los' else
Conclusion1 <- 'Sample
comes from postulated prob. distribution at 1% los'
if(chi_squared > cv2)
Conclusion2 <- 'Sample
does not come from the postulated probability distribution at 5% los' else
Conclusion2 <- 'Sample
comes from postulated prob. distribution at 1% los'
if(chi_squared > cv3)
Conclusion3 <- 'Sample
does not come from the postulated probability distribution at 10% los' else
Conclusion3 <- 'Sample
come from postulated prob distribution at 1% los'
#-----------------------------------------------------------------------------
# Printing RESULTS
print(chi_squared)
print(p_value)
print(df)
print(cv1)
print(cv2)
print(cv3)
print(Conclusion1)
print(Conclusion2)
print(Conclusion3)
##### End of R Script
########
________________________________________________________
Problem Faced :
When I run this script using
R – console,
I am getting value of Chi – Square Statistics as
high as “6.95753e+37”
When I did the same calculations in Excel, I got
the Chi Square Statistics value = 138.34.
Although it is clear that the sample data doesn’t
follow Poisson distribution, and I will have to look for other discrete
distribution, my problem is the HIGH Value of Chi Square test statistics. When
I analyzed further, I understood the problem.
(A) By convention, if your Expected
frequency is less than 5, then by we put together such classes and form a new
class such that Expected frequency is greater than 5 and also accordingly
adjust the observed frequencies.
X
Oi
Ei
((Oi - Ei)^2)/Ei
0
0
10
9.96
1
72
23
103.79
2
17
27
3.54
3
5
21
11.85
4
3
12
6.71
5
4
9
2.51
Total
101
101
138.34
When I apply this logic in Excel, I am getting the
reasonable result (i.e. 138.34), however in Excel also, if I don’t apply this
logic, my Chi square test statistic value is as high as 4.70043E+37.
My
question is how do I modify my R – script, so that the logic mentioned in (A)
i.e. adjusting the Expected frequencies (and accordingly Observed frequencies) is
applied so that the expected frequency becomes greater than 5 for a given
class, thereby resulting in reasonable value of Chi Square test Statistics.
I am also attaching the xls file for ready
reference.
I sincerely apologize for taking liberty of writing
such a long mail and since I am very new to this “R language” can someone help
me out.
Thanking in advance for your kind co-operation.
Ashok (Mumbai,
India)
More information about the R-help
mailing list