[Rd] informal conventions/checklist for new predictive modeling packages

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Jan 5 16:16:54 CET 2012


Good stuff, Max!

Would also be nice to nail your 14 theses to a more permanent wall
than the r-help mailing list ... not sure where that would be, though
... isn't someone supposed to be redesigning the r-project.org
website? [I jest, I jest] More seriously, though, it might be worth
linking to from the developer.r-project.org site as well as from some
blurb in the header of the ML task view.


-steve

On Wed, Jan 4, 2012 at 9:19 AM, Max Kuhn <mxkuhn at gmail.com> wrote:
> Working on the caret package has exposed me to the wide variety of
> approaches that different authors have taken to creating predictive
> modeling functions (aka machine learning)(aka pattern recognition).
>
> I suspect that many package authors are neophyte R users and are
> stumbling through the process of writing their first R package (or R
> code). As such, they may not have been exposed to some of the informal
> conventions that have evolved over time. Also, their package may be
> intended to demonstrate their research and not for "production"
> modeling. In any case, it might be a good idea to print up a few
> points for consideration when creating a predictive modeling package.
> I don't propose changes to existing code.
>
> Some of this is obvious and not limited to this class of modeling
> packages. Many of these points are arguable, so please do so.
>
> If this seems useful, perhaps we could repost the final list to R-Help
> to use as a checklist.
>
> Those of you who have used my code will probably realize that I am not
> a grand architect of R packages =] I'd love to get feedback from those
> of you with a broader perspective and better software engineering
> skills than I (a low bar to step over).
>
> I have marked a few of these items with an OCD tag since I might be
> taking it a bit too far.
>
> The list:
>
> (1) Extend the work of others. There is an amazing amount of unneeded
> redundancy. There are plenty of times that users implement their own
> version of a function because there is an missing feature, but a lot
> of time is spent re-creating duplicate functions. For example, kernlab
> has an excellent set of kernel functions that are really efficient and
> have useful ancillary functions. People may not new aware of these
> functions, but they are one RSiteSearch away. (Perhaps we could
> nominate a few packages like kernlab that implement a specific tool
> well)
>
> (2) When modeling a categorical outcome, use a factor as input (as
> opposed to 0/1 indicators or integers). Factors are exactly the kind
> of feature that separates R from other languages (I'm looking at you
> SAS) and is a natural structure for this data type.
>
> corollary (2a): save the factor levels in the model object somewhere
>
> corollary (2b): return predicted classes as factors with the same
> levels (and ordering of levels).
>
> (3) Implement a separate prediction function. Some packages only make
> predictions when the model is built, so effectively the model cannot
> be used at any point in the future.
>
> corollary (3a): use object-orientation (eg. predict.{class}) and not
> some made-up function name "modelPredict()" for predicting new
> samples.
>
> (4) If the method only accepts a specific type of input (eg. matrix or
> data frame), please do the conversion whenever appropriate.
>
> (5) Provide a formula interface (eg. foo(y~x, data = dat)) and
> non-formula interface (foo(x, y) to the function. Formula methods are
> really inefficient at this time for large dimensional data but are
> fantastically convenient. There are some good reasons to not use
> formulas, such as functions that do not use a design matrix (eg.
> cforest()) or need factors to be handled in a non-standard way (eg.
> cubist()).
>
> (6) Don't require a test set when model building.
>
> (7) Control all written output during model-building time with a
> verbose option. Resampling can make a mess out of things if
> output/logging is always exposed.
>
> (8) Please use RSiteSearch to avoid name collisions between packages
> (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor.
>
> (9) Allow the predict function to generate results from many different
> sub-models simultaneously. For example, pls() can return predictions
> across many values of ncomp. enet(), cubist(), blackboost() are other
> examples.
>
> corollary (9a): [OCD] ensure the same object type for predictions.
> There are occasions where predict() will return a vector or a matrix
> depending on the context. I would argue that this is not optimal.
>
> (10) Use a limited vocabulary for options. For example, some predict()
> functions have a "type" options to switch between predicted classes
> and class probabilities. Values of "type" pertaining to class
> probabilities range from "prob", "probability", "posterior", "raw",
> "response", etc. I'll make a suggestion of "prob" as a possible
> standard for this situation.
>
> (11) Make sure that class probabilities sum to one. Seriously.
>
> (12) If the model implicitly conducts feature selection, do not
> require un-used predictors to be present in future data sets for
> prediction. This may be a problem when the formula interface to models
> is used, but it looks like many functions reference columns by
> position and not name.
>
> (13) Packages that have their own cross-validation functions should
> allow the users to pass in the specific folds/resamping indicators to
> maintain consistency across similar functions in other packages.
>
> (14) [OCD] For binary classification models, model the probability of
> the first level of a factor as the event of interest (again, for
> consistency) Note that glm() does not do this but most others use the
> first level.
>
> Thanks,
>
> Max
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the R-devel mailing list