[Rd] application to mentor syrfr package development for Google Summer of Code 2010

Thu Mar 11 06:21:45 CET 2010

Michael,

Thanks for your reply with the information about the Eureqa API -- I
am forwarding it to the r-devel list below.

Dirk,

Will you please agree to referring to the syrfr package as symbolic
genetic algorithm regression of functions but not (yet) general
relations?  It would be best to refer to general relation regression
as a future package, something like 'syrr' and leave the
parametrization of the derivatives to that package.

May I please mentor, in consultation with Michael if necessary, work
on general function regressions while Chillu and John Nash work on the
derivative package necessary for general relation regressions?  Thank
you for your kind consideration.

Best regards,
James Salsman

On Tue, Mar 9, 2010 at 11:06 AM, Michael Schmidt <mds47 at cornell.edu> wrote:
> I think it's a great idea worth trying out. We have always done significance
> tests just on the final frontier of models as a post processing step. Moving
> this into the algorithm could focus the search more on significant higher
> quality solutions. One thing to beware of though is that using parsimony
> pressure just on the number of free parameters tends to focus the search on
> complex equations with no free parameters. So, some care should be taken how
> to implement it.
>
> We do see a measurable improvement using crossover versus just mutation on
> random test problems. Empirically, it doesn't seem necessary for all
> problems but also doesn't seem to ever inhibit the search.
>
> I didn't know that anyone was working on a SR package for R. Very cool! I'm
> happy to consult if you have any questions I can help with.
>
> You may also be interested that we just recently opened up the API for
> interacting with Eureqa servers:
> http://code.google.com/p/eureqa-api/
> If you know of anyone that might be interested in making a wrapper for R,
> please forward.
>
> Michael
>
>
> On Mon, Mar 8, 2010 at 5:45 PM, James Salsman <jsalsman at talknicer.com>
> wrote:
>>
>> Michael,
>>
>> Thanks for your reply:
>>
>> On Mon, Mar 8, 2010 at 12:41 AM, Michael Schmidt <mds47 at cornell.edu>
>> wrote:
>> >
>> > Thanks for contacting me. Eureqa takes into account the total size of an
>> > equation when comparing different candidate models. It attempts to find
>> > the
>> > set of possible equations that are non-dominated in both error and size.
>> > The
>> > final results is a short list consisting of the most accurate equation
>> > for
>> > increasing equation sizes.
>> >
>> > This is closely related to degrees of freedom, but not exactly the same
>>
>> That's very good, but I wonder whether we can perform automatic
>> outlier exclusion that way.  We would need to keep the confidence
>> interval, or at least the information necessary to derive it, accurate
>> in every step of the genetic beam search.  Since the confidence
>> intervals of extrapolation depend so heavily on the number of degrees
>> of freedom of the fit (along with the residual standard error) it's a
>> good idea to use a degree-of-freedom-adjusted F statistic instead of a
>> post-hoc combination of equation size and residual standard error, I
>> would think.  You might want to try it and see how it improves things.
>>  Confidence intervals, by representing the goodness of fit in the
>> original units and domain of the dependent variable, are tremendously
>> useful and sometimes make many kinds of tests which would otherwise be
>> very laborious easy to eyeball.
>>
>> Being able to fit curves to one-to-many relations instead of strict
>> one-to-one functions appeals to those working in the imaging domain,
>> but not to as many traditional non-image statisticians. Regressing
>> functions usually results in well-defined confidence intervals, but
>> regressing general relations with derivatives produces confidence
>> intervals which can also be relations.  Trying to figure out a
>> spiral-shaped confidence interval probably appeals to astronomers more
>> than most people.  So I am proposing that, for R's contemplated
>> 'syrfr' symbolic regression package, we do functions in a general
>> genetic beam search framework, Chillu and John Nash can do derivatives
>> in the new 'adinr' package, and then we can try to put them together,
>> extend the syrfr package with a parameter indicating to fit relations
>> with derivatives instead of functions, to try to replicate your work
>> on Eureqa using d.o.f-adjusted F statistics as a heuristic beam search
>> evaluation function.
>>
>> Have you quantified the extent to which using the crossover rule in
>> the equation tree search is an improvement over mutation alone in
>> symbolic regression?  I am glad that Chillu and Dirk have already
>> supported that; there is no denying its utility.
>>
>> Would you like to co-mentor this project?
>> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
>> I've already stepped forward, so you could do as much or as little as
>> you like if you wanted to co-mentor and Dirk agreed to that
>> arrangement.
>>
>> Best regards,
>> James Salsman
>>
>>
>> > On Mon, Mar 8, 2010 at 2:49 AM, James Salsman <jsalsman at talknicer.com>
>> > wrote:
>> >>
>> >> I meant that development on both a syrfr R package capable of
>> >> using either F statistics or parametric derivatives should proceed in
>> >> parallel with your work on such a derivatives package. You are right
>> >> that genetic algorithm search (and general best-first search --
>> >> http://en.wikipedia.org/wiki/Best-first_search -- of which genetic
>> >> algorithms are various special cases) can be very effectively
>> >> parallelized, too.
>> >>
>> >> In any case, thank you for pointing out Eureqa --
>> >> http://ccsl.mae.cornell.edu/eureqa -- but I can see no evidence there
>> >> or in the user manual or user forums that Eureqa is considering
>> >> degrees of freedom in its goodness-of-fit estimation.  That is a
>> >> serious problem which will typically result in invalid symbolic
>> >> regression.  I am sending this message also to Michael Schmidt so that
>> >> he might be able to comment on the extent to which Eureqa adjusts for
>> >> degrees of freedom in his fit evaluations.
>> >>
>> >> Best regards,
>> >> James Salsman
>> >>
>> >> On Sun, Mar 7, 2010 at 10:39 PM, Chidambaram Annamalai
>> >> <quantumelixir at gmail.com> wrote:
>> >> >
>> >> >> If I understand your concern, you want to lay the foundation for
>> >> >> derivatives so that you can implement the search strategies
>> >> >> described
>> >> >> in Schmidt and Lipson (2010) --
>> >> >> http://www.springerlink.com/content/l79v2183725413w0/ -- is that
>> >> >> right?
>> >> >
>> >> > Yes. Basically traditional "naive" error estimators or fitness
>> >> > functions
>> >> > fail miserably when used in SR with implicit equations because they
>> >> > immediately close in on "best" fits like f(x) = x - x and other
>> >> > trivial
>> >> > solutions. In such cases no amount of regularization and complexity
>> >> > penalizing methods will help since x - x is fairly simple by most
>> >> > measures
>> >> > of complexity and it does have zero error. So the paper outlines such
>> >> > problems associated with "direct" error estimators and thus they
>> >> > infer
>> >> > the
>> >> > "triviality" of the fit by probing its estimates around nearby points
>> >> > and
>> >> > seeing if it does follow the pattern dictated by the data points --
>> >> > ergo
>> >> > derivatives.
>> >> >
>> >> > Also, somewhat like a side benefit, this method also enables us to
>> >> > perform
>> >> > regression on closed loops and other implicit equations since the
>> >> > fitness
>> >> > functions are based only on derivatives. The specific form of the
>> >> > error
>> >> > is
>> >> > equation 1.2 which is what, I believe, comprises of the internals of
>> >> > the
>> >> > evaluation procedure used in Eureqa.
>> >> >
>> >> > You are correct in pointing out that there is no reason to not work
>> >> > in
>> >> > parallel, since GAs generally have a more or less fixed form
>> >> > (evaluate-reproduce cycle) which is quite easily parallelized. I have
>> >> > used
>> >> > OpenMP in the past, in which it is fairly trivial to parallelize well
>> >> > formed
>> >> > for loops.
>> >> >
>> >> > Chillu
>> >> >
>> >> >> It is not clear to me how well this generalized approach will
>> >> >> work in practice, but there is no reason not to proceed in parallel
>> >> >> to
>> >> >> establish a framework under which you could implement the metrics
>> >> >> proposed by Schmidt and Lipson in the contemplated syrfr package.
>> >> >>
>> >> >> I have expanded the test I proposed with two more questions -- at
>> >> >>
>> >> >>
>> >> >> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
>> >> >> -- specifically:
>> >> >>
>> >> >> 5. Critique http://sites.google.com/site/gptips4matlab/
>> >> >>
>> >> >> 6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with
>> >> >> a
>> >> >> linear model of your choice. How can your characterize the
>> >> >> degree-of-freedom-adjusted goodness of fit of nonlinear models?
>> >> >>
>> >> >> I believe pairwise anova.nls is the optimal comparison for nonlinear
>> >> >> models, but there are several good choices for approximations,
>> >> >> including the residual standard error, which I believe can be
>> >> >> adjusted
>> >> >> for degrees of freedom, as can the F statistic which TableCurve
>> >> >> uses;
>> >> >> see: http://en.wikipedia.org/wiki/F-test#Regression_problems
>> >> >>
>> >> >> Best regards,
>> >> >> James Salsman
>
>
>
> --
> Michael Schmidt
> Cornell Computational Synthesis Lab
> Cornell University, 239 Upson Hall, Ithaca, NY 14853
> email: mds47 at cornell.edu
>