[R] Fitdistr() versus nls()

Sun Sep 24 09:08:19 CEST 2006

On Sat, 23 Sep 2006, Luca Telloli wrote:

> Hello R-Users,
> 	I'm new to R so I apologize in advance for any big mistake I might
> be doing. I'm trying to fit a set of samples with some probabilistic
> curve, and I have an important question to ask; in particular I have
> some data, from which I calculate manually the CDF, and then I import
> them into R and try to fit: I have the x values (my original samples)
> and the y values (P(X<x)).
> 	To attempt the fit I've both fitdistr() and nls(), in the way you
> can read in the piece of code at the end of the email. Because the
> fit with all data doesn't work very well, I decided to take a subset
> of samples randomly chosen (for some random x, the correspondant y is
> chosen).
> 	The first big problem is that fitdistr and nls, in the way I use
> them in the code, they don't get me similar results. I think I'm
> making a mistake but I can't really understand which one.
> 	From this first issue, a second one arises because the plot with nls
> is similar (maybe not a great fit bust still...) to my original CDF
> while the plot of fitdistr is basically a straight line of constant
> value y=1. On the other side, the fitdistr outperforms in the
> Kolmogorov-Smirnov test, which for nls has values of probability
> around 0 (not a good score huh?).
> 	Can u please tell me if there's a major mistake in the code?

Yes: you fit distribution models to x, not to the CDF of x.  So the call 
to fitdistr and those to ks.test should be of somevals.x not somevals.y.

Using nls here is fitting an exponentional to the CDF by least-squares.
This is not very sensible for an exponentional distribution where the MLE 
is known in closed form.

Using a KS test after estimating parameters is invalid unless you make 
adjustments (ks.test does not), and suitable adjustments are AFAIK only 
known for ML estimation.  (The help page does say so and point you to the 
literature.)

>
> Thanks in advance,
> Luca
>
>
> ------ BEGINNING OF CODE
> ----------------------------------------------------------------
> cdf.all=read.table("all_failures.cdf", header=FALSE, col.names=c
> ("ttr", "cdf"), sep=":" )
>
> allvals.x=array(t(cdf.all[1]))
> allvals.y=array(t(cdf.all[2]))

What is the intention here?  I suspect you want cdf.all[[1]] etc.

> library(MASS)
> bestval.exp.nls=bestval.exp.fit=-1
> plot(allvals.x, allvals.y)
>
> for(it in 1:100){
> 	#extract random samples
> 	random=sort(sample(1:length(allvals.x), 15))
> 	somevals.x=allvals.x[c(random)]
> 	somevals.y=allvals.y[c(random)]
> 	#fit with nls and fitdistr
> 	fit.exp = fitdistr(somevals.y, "exponential")
> 	nls.exp <- nls(somevals.y ~ pexp(somevals.x, rate), start=list(rate=.
> 0001), model=TRUE)
> 	#plot what you get out of the two fits
> 	lines(allvals.x, pexp(allvals.x, coef(fit.exp)), col=it)
> 	lines(allvals.x, pexp(allvals.x, coef(nls.exp)), col=it)
> 	#perform kolmogorov-smirnov test on your fit
> 	ks.exp.nls = ks.test(somevals.y, "pexp", coef(nls.exp))
> 	ks.exp.fit = ks.test(somevals.y, "pexp", coef(fit.exp))
>
> 	bestval.exp.fit = max(bestval.exp.fit, ks.exp.fit$p.value)
> 	bestval.exp.nls = max(bestval.exp.nls, ks.exp.nls$p.value)
> }
>
> print(bestval.exp.fit)
> print(bestval.exp.nls)
>
> ----------END OF
> CODE--------------------------------------------------------------------

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595