[Rd] Advice on package design for handling of dots in a formula

Wed Oct 15 16:11:14 CEST 2014

I am working on a new package, one in which the user needs to specify the
role that different variables play in the analysis. Where I'm stumped is the
best way to have users specify those roles.

Approach #1: Separate formula for each special component

First I thought to have users specify each formula separately, like:

new.function(formula=y~X1+X2+X3,
             weights=~w,
             observationID=~ID,
             strata=~site,
             data=mydata)

This seems to be a common approach in other packages. However, one of my
testers noted that if he put formula=y~. then w, ID, and site showed up in
the model where they weren't supposed to be. I could add some code to try to
prevent that (string matching and editing the terms object, perhaps?), but
that seemed a little clumsy to me.

Approach #2: Create specials to label special variables

So I turned to the user interface design in coxph where the user can specify
strata and cluster in a single formula. So my approach would look something
like:

new.function(formula=y~weights(w)+strata(site)+observationID(ID)+X1+X2+X3,
             data=mydata)

My aim would be that the user could use a dot instead of X1+X2+X3 and the
dot would not expand to include w, site, and ID. However, at least as
implemented in coxph(), this approach does not handle the dot in the formula
any better than the first approach.

Call:
coxph(formula = Surv(time, status) ~ strata(sex) + ., data = test1)

     coef exp(coef) se(coef)     z    p
x   0.802      2.23    0.822 0.976 0.33
sex    NA        NA    0.000    NA   NA

Surely the user wants the dot to mean all the other variables but not the
ones that are already in the model, like sex. I could also develop some code
(again perhaps clumsily) to search after the fact for variables (like sex)
that shouldn't be in there.

Approach #3: Require the user to first describe a separate study design
object

Lastly I looked at the design for the survey package. This package first
requires the user to create an object that describes the key components of
the dataset. So I would have the user do something like this:

mystudy <- study.design(weights=~w,
                        observationID=~ID,
                        strata=~site,
                        data=mydata)
myresults <- doanalysis(formula=y~X1+X2+X3, design=mystudy)

But it seems that the survey package is also not designed to handle the dot.

data(api)
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
svyglm(api00~., design=dstrat)
Error in svyglm.survey.design(api00 ~ ., design = dstrat) : 
  all variables must be in design= argument

Does anyone have advice on how best to handle this? 
1. Tell my tester "Tough, you can't use dots in a formula in my
package".essentially what the survey package seems to do. Encourage the use
of survey::make.formula()?
2. Fix Approach #1 to search for duplicates in the weights, observation ID,
and strata parameters. Any elegant ways to do that?
3. Fix Approach #2, the coxph style, to try to remove redundant covariates.
Not sure if there's a graceful way not involving string matching
4. Any existing elegant approaches to interpreting the dot? Or should I just
do string matching to delete duplicate variables from the terms object.

Thanks,
Greg

Greg Ridgeway
Associate Professor
University of Pennsylvania