formula {stats} | R Documentation |
The generic function formula
and its specific methods provide a
way of extracting formulae which have been included in other objects.
as.formula
is almost identical, additionally preserving
attributes when object
already inherits from
"formula"
.
formula(x, ...)
DF2formula(x, env = parent.frame())
as.formula(object, env = parent.frame())
## S3 method for class 'formula'
print(x, showEnv = !identical(e, .GlobalEnv), ...)
x , object |
R object, for |
... |
further arguments passed to or from other methods. |
env |
the environment to associate with the result, if not already a formula. |
showEnv |
logical indicating if the environment should be printed as well. |
The models fitted by, e.g., the lm
and glm
functions are specified in a compact symbolic form.
The ~
operator is basic in the formation of such models.
An expression of the form y ~ model
is interpreted
as a specification that the response y
is modelled
by a linear predictor specified symbolically by model
.
Such a model consists of a series of terms separated
by +
operators.
The terms themselves consist of variable and factor
names separated by :
operators.
Such a term is interpreted as the interaction of
all the variables and factors appearing in the term.
In addition to +
and :
, a number of other operators are
useful in model formulae.
The *
operator denotes factor crossing: a*b
is
interpreted as a + b + a:b
.
The ^
operator indicates crossing to the specified degree. For example
(a+b+c)^2
is identical to (a+b+c)*(a+b+c)
which in turn
expands to a formula containing the main effects for a
,
b
and c
together with their second-order interactions.
The %in%
operator indicates that the terms on its left are
nested within those on the right. For example a + b %in% a
expands to the formula a + a:b
.
The /
operator provides a shorthand, so that
a / b
is equivalent to a + b %in% a
.
The -
operator removes the specified terms, hence
(a+b+c)^2 - a:b
is identical to a + b + c + b:c + a:c
.
It can also used to remove the intercept term: when fitting a linear
model y ~ x - 1
specifies a line through the origin.
A model with no intercept can be also specified as y ~ x + 0
or y ~ 0 + x
.
While formulae usually involve just variable and factor
names, they can also involve arithmetic expressions.
The formula log(y) ~ a + log(x)
is quite legal.
When such arithmetic expressions involve
operators which are also used symbolically
in model formulae, there can be confusion between
arithmetic and symbolic operator use.
To avoid this confusion, the function I()
can be used to bracket those portions of a model
formula where the operators are used in their
arithmetic sense. For example, in the formula
y ~ a + I(b+c)
, the term b+c
is to be
interpreted as the sum of b
and c
.
Variable names can be quoted by backticks `like this`
in
formulae, although there is no guarantee that all code using formulae
will accept such non-syntactic names.
Most model-fitting functions accept formulae with right-hand-side
including the function offset
to indicate terms with a
fixed coefficient of one. Some functions accept other
‘specials’ such as strata
or cluster
(see the
specials
argument of terms.formula
).
There are two special interpretations of .
in a formula. The
usual one is in the context of a data
argument of model
fitting functions and means ‘all columns not otherwise in the
formula’: see terms.formula
. In the context of
update.formula
, only, it means ‘what was
previously in this part of the formula’.
When formula
is called on a fitted model object, either a
specific method is used (such as that for class "nls"
) or the
default method. The default first looks for a "formula"
component of the object (and evaluates it), then a "terms"
component, then a formula
parameter of the call (and evaluates
its value) and finally a "formula"
attribute.
There is a formula
method for data frames. When there's
"terms"
attribute with a formula, e.g., for a
model.frame()
, that formula is returned. If you'd like the
previous (R \le
3.5.x) behavior, use the auxiliary
DF2formula()
which does not consider a "terms"
attribute.
Otherwise, if
there is only
one column this forms the RHS with an empty LHS. For more columns,
the first column is the LHS of the formula and the remaining columns
separated by +
form the RHS.
All the functions above produce an object of class "formula"
which contains a symbolic model formula.
A formula object has an associated environment, and
this environment (rather than the parent
environment) is used by model.frame
to evaluate variables
that are not found in the supplied data
argument.
Formulas created with the ~
operator use the
environment in which they were created. Formulas created with
as.formula
will use the env
argument for their
environment.
In R versions up to 3.6.0, character
x
of length
more than one were parsed as separate lines of R code and the first
complete expression was evaluated into a formula when possible. This
silently truncates such vectors of characters inefficiently and to some
extent inconsistently as this behaviour had been undocumented. For this
reason, such use has been deprecated. If you must work via character
x
, do use a string, i.e., a character vector of length one.
E.g., eval(call("~", quote(foo + bar)))
has been an order of magnitude
more efficient
than formula(c("~", "foo + bar"))
.
Further, character “expressions” needing an eval()
to return a formula are now deprecated.
Chambers, J. M. and Hastie, T. J. (1992) Statistical models. Chapter 2 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
For formula manipulation: update.formula
,
terms.formula
, and all.vars
.
For typical use: lm
, glm
, and
coplot
.
For formula construction: reformulate
.
class(fo <- y ~ x1*x2) # "formula"
fo
typeof(fo) # R internal : "language"
terms(fo)
environment(fo)
environment(as.formula("y ~ x"))
environment(as.formula("y ~ x", env = new.env()))
## Create a formula for a model with a large number of variables:
xnam <- paste0("x", 1:25)
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))
## Equivalent with reformulate():
fmla2 <- reformulate(xnam, response = "y")
stopifnot(identical(fmla, fmla2))