[R] How to calculate area between ECDF and CDF?
Mike Lawrence
m4lawren at artsmail.uwaterloo.ca
Mon Dec 4 19:09:01 CET 2006
Hi all,
I'm working with data to which I'm fitting three-parameter weibull
distributions (shape, scale & shift). The data are of low sample sizes
(between 10 and 80 observations), so I'm reluctant to check my fits
using chi-square (also, I'd like to avoid bin choice issues). I'd use
the Kolmogorov-Smirnov test, but of course this is invalid when the
distribution parameters are estimated from the data.
So I'm tinkering with an alternative method (excuse my naiveté if this
is a bad idea, I'm a relative statistical novice) that calculates the
area of the difference between the ECDF of the data and the CDF of the
estimated function (somewhat like KS, which looks at the greatest
distance between these). My thought is to compare this observed area to
a distribution of simulated areas derived by monte carlo simulation
(draw N random samples from the estimated function, calculating area,
and repeat 1e5 times). If the observed area is greater than say 95% of
the simulated areas, then I'd reject the fit.
My problem is that I can't figure out how to efficiently calculate the
area between the ECDF and CDF functions. I can of course calculate the
integral of each easily, and if one were consistently larger than the
other simple subtraction of the integrals would yield the area between.
However, when the functions cross, as frequently occurs, the solution
seems much more complex. Any suggestions? Since as noted above I'll be
doing the area calculation 1e5 times or so per test, a computationally
frugal solution would be much appreciated!
Here's some code that I've been toying with:
#set up some true parameters
shape=2
scale=.5
shift=.3
n=10
#generate some observed data
obs=obs=rweibull(10,shape,scale)+shift
#lets say that the following are the estimated parameters from whatever
estimation process I'm using
est.shape=1.9
est.scale=.6
est.shift=.35
#Calculate area between ECDF and CDF of the function defined by the
#estimated parameters
# ???
#The following would work if the ECDF were consistently higher or lower
#than the CDF
#Get the CDF area between 0 and some large number (here, 10 is pretty
#large)
cdf.area=integrate(pweibull,0,10,shape=est.shape,scale=est.scale)
#Get the ECDF area.
#first get rid of the shift in obs
obs=obs-est.shift
#calculate area by multiplying cumulative proportions by distance
#between knots, then summing
#add knot at 10 to match cdf
k=c(knots(ecdf(obs)),10)
ecdf.area=vector("numeric",(n-1))
for(i in 1:n){
ecdf.area[i]=(k[i+1]-k[i])*(sum(obs<=k[i])/n)
}
ecdf.area=sum(ecdf.area)
#again, subtraction of the areas works if the ecdf is consistently lower
#than the cdf
diff=cdf.area-ecdf.area
#or consistently higher than the cdf
diff=ecdf.area-cdf.area
#but how to calculate when the functions cross?
Cheers,
Mike
--
Mike Lawrence
http://artsweb.uwaterloo.ca/~m4lawren
"The road to wisdom? Well, it's plain and simple to express:
Err and err and err again, but less and less and less."
- Piet Hein
More information about the R-help
mailing list