[R] R's Data Dredging Philosophy for Distribution Fitting

Ben Bolker bbolker at gmail.com
Thu Jul 15 03:25:02 CEST 2010

emorway <emorway <at> engr.colostate.edu> writes:

> Forum, 
> I'm a grad student in Civil Eng, took some Stats classes that required
> students learn R, and I have since taken to R and use it for as much as I
> can.  Back in my lab/office, many of my fellow grad students still use
> proprietary software at the behest of advisers who are familiar with the
> recommended software (Statistica, @Risk (Excel Add-on), etc).  I have spent
> a lot of time learning R and am confident it can generally out-process,
> out-graph, or more simply stated, out-perform most of these other software
> packages.  However, one area my view has been humbled in is distribution
> fitting.
> I started by reading through
> http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf  After that
> I started digging around on this forum and found posts like this one
> http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000
> that are close to what I'm after.  That is, given an observation dataset, I
> would like to call a function that cycles through numerous distributions
> (common or not) and then ranks them for me based on Chi-Square,
> Kolmogorov-Smirnov and/or Anderson-Darling, for example.  
> This question was asked back in 2004:
> http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response
> was that this kind of thing wasn't in R nor in proprietary software to the
> best of the responding author's memory.  In 2010, however, this is no longer
> true as @Risk's
> (http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg)
> "Distribution Fitting" function does this very thing.  And it is here that
> my R pride has taken a hit.  Based on the first response to the question
> posed here
> is it fair to say that the R community (I realize this is only 1 view) would
> take exception to this kind of "data mining"?  
> Unless I've missed a discussion of a package that does this very thing, it
> seems as though I would need to code something up using fitdistr() and do
> all the ranking myself.  Undoubtedly that would be a good exercise for me,
> but its hard for me to believe R would be a runner-up to something like
> distribution fitting in @Risk.

   I was one of the respondents in some of the threads you list above,
and I still question why you're doing this in the first place: it's not
*necessarily* a silly thing to do, but that would be my default position.

  It's not hard to hack up something that tries all the distributions
fitdistr() knows up and compares their AIC values (completely ignoring
sensible considerations like whether the distribution is discrete
or not ...)  See below ...

  It's hard to see how you could have a mechanistic (rather
than phenomenological) model in mind if you just want to try
a whole variety of families (not 1 or 2).  Perhaps some flexible
family like Johnson distributions 
would be appropriate, or log-spline densities
<http://cran.r-project.org/web/packages/logspline/logspline.pdf> ...

distlist <- c("beta","cauchy","chi-squared","exponential",
              "negative binomial","normal","poisson","t","weibull")

x <- runif(1000)

dd <- function(...) {

s <- lapply(as.list(distlist),dd,x=x)
names(s) <- distlist

sapply(s,function(z) if (inherits(z,"try-error")) NA else AIC(z))

More information about the R-help mailing list