[R] pulling items out of a lm() call

Peter Dalgaard p.dalgaard at biostat.ku.dk
Mon May 1 13:37:21 CEST 2006


Andrew Gelman <gelman at stat.columbia.edu> writes:

> I want to write a function to standardize regression predictors, which 
> will require me to do some character-string manipulation to parse the 
> variables in a call to lm() or glm().
> 
> For example, consider the call
> lm (y ~ female + I(age^2) + female:black + (age + education)*female).
> 
> I want to be able to parse this to pick out the input variables 
> ("female", "age", "black", "education").  Then I can transform these as 
> appropriate (to get "z.female", "z.age", etc), feed them back into the 
> lm() function, and go from there.
> 
> Does anyone know an easy way to pull out the variables?  I basically 
> have to parse out the symbols "+", ":", "*", and " ", but there's also 
> the problem of handling parentheses and the I() operator.

At which level of generality do you want this?

Consider
> attr(terms(y ~ female + I(age^2) + female:black + (age +
+               education)*female),"variables")

list(y, female, I(age^2), black, age, education)

> attr(delete.response(terms(y ~ female + I(age^2) + female:black +
+          (age + education)*female)),"variables")
list(female, I(age^2), black, age, education)

This gets you some of the way. However, there are complications: You
can't just remove composite terms like "I(age^2)" because it is not
guaranteed that "age" is in among the other terms:

> attr(terms( ~ I(speed^2)),"variables")
list(I(speed^2))

So you need some way to tease out the individual variables inside I().

Here's a first cut.

l <- attr(delete.response(terms(y ~ female + I(age^2) + female:black
             + (age + education)*female)),"variables")

getterms <- function(e) {
    if (is.name(e)) e 
    else if (is.call(e)) lapply(e[-1], getterms)}

unique(c(lapply(l[-1],getterms), recursive=TRUE)) 

and possibly throw in an as.character() to get a vector of strings,
rather than a list of symbols. Notice that since anything can go
inside I(), you can get in trouble if parts of the expression is not
intended as a variable (e.g., y^lambda where lambda is a scalar). The
getterms function above pragmatically assumes that at least function
names need to be discarded.

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907




More information about the R-help mailing list