[R] Playing with formulae

Sat Sep 13 01:12:39 CEST 2003

On Fri, 12 Sep 2003, Ross Boylan wrote:

> First, thanks to everyone for their responses to my programming style
> question.  Second, I have some questions about some obscure corners of
> the language.
>
> Let
> f <- y~x+z
> t <- terms(f).
>
> I want to do some manipulations of the formula that require getting the
> names of variables as character strings (e.g., for indexing into a
> dataset).  However, t, or even attr(t, "variables"), does not provide
> character strings.
>
> 1. Does all.vars(f) reliably produce the same ordering as t?

It doesn't even produce the same set: consider

y~I(x+z)+w

> 2. Can objects of class name (which I notice appear in places in t) be
> used the same way as character strings (e.g., indexing columns in a data
> set, arguments to match)?  (This would matter if I could pull t apart
> reliably.  I can't.  See 3b for more on that problem.)

No, but as.character will convert them to strings.

> 3. t's response attribute is said to be an index of the response
> variable in variables (I presume this means the variables attribute).
>   a) Will all.vars(f)[attr(t, "response")] reliably get me the character
>      string for the name of the response variable?

No.  consider
Surv(t,s)~x+z

>   b) How can I get the response variable out of the "variables"
>      attribute?  In my example,
>      response is 1, but attr(t, "variables")[1] is list().
>      Possible answer: attr(t, "variables")[[response+1]] looks right,
>      and is of class name.  Hence the interest in question 2.

The "factors" attribute has row names corresponding to variables and
column names corresponding to terms.

> 4. Is the actual number of coefficients the model will need
> length(attr(t,"term.labels"))+attr(t, "intercept"),
> regardless of interactions or I() terms?

No. A term can create multiple columns of the design matrix, eg factors,
polynomials, splines.  You won't know how many until you call
model.matrix.

> 5. The documentation for terms.formula appears to imply that if there is
> a simple formula without interactions I will get coefficient estimates
> in the same order that the original formula specified textually.  Right?
> I'm concerned about this because I'm having a vector of simulation
> coefficients passed in along with the formula, and I need to be sure
> they line up with the model terms.

Yes.

It might be useful to have names on the coefficients,  though.  Then you
could match on the names and not worry

An example of the sort of thing you're trying to do is in
untangle.specials() in the survival package, which is used to locate terms
and variables for strata() and cluster() in coxph().  It uses the dimnames
of the "factors" attribute as keys.

	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle