[Rd] Proposal: model.data

Paul Johnson pauljohn32 at gmail.com
Thu May 3 17:51:51 CEST 2012


Greetings:

I'm still working on functions to make it easier for students to
interpret regression predictions.  I am working out a scheme to
more easily create "newdata" objects (for use in predict functions).
This has been done before in packages like Zelig, Effects,
rms, and others. Nothing is "exactly right," though. Stata users keep
flashing their "predicted probabity tables" at me and I'd like
something like that to use with R.

I'm proposing here a function model.data that receives a regression
and creates a dataframe of raw data from which newdata objects
can be constructed. This follows a suggestion that Bill Dunlap made
to me in response to a question I posted in r-help.

While studying termplot code, I saw the "carrier" function approach
to deducing the "raw" predictors.  However, it does not always work.

Here is one problem. termplot mistakes 10 in log(10 + x1) for a variable

Example:

dat <- data.frame(x1 = rnorm(100), x2 = rpois(100, lambda=7))
STDE <- 10
dat$y <- 1.2 * log(10 + dat$x1) + 2.3 * dat$x2 + rnorm(100, sd = STDE)

m1 <- lm( y ~ log(10 + x1)  + x2, data=dat)
termplot(m1)

## See the trouble? termplot thinks 10 is the term to plot.

Another problem is that predict( type="terms") does not behave
sensibly sometimes. RHS of formula that have nonlinear transformations
are misunderstood as separate terms.

##Example:
dat$y2 <- 1.2 * log(10 + dat$x1) + 2.3 * dat$x1^2 + rnorm(100, sd = STDE)

m2 <- lm( y2 ~ log(10 + x1)  + sin(x1), data=dat)
summary(m2)

predict(m2, type="terms")

## Output:
## log(10 + x1)     sin(x1)
## 1     1.50051781 -2.04871711
## 2    -0.14707391  0.31131124

What I wish would happen instead is one "correct" prediction
for each value of x1. This should be the output:

predict(m2, newdata = data.frame(x1 = dat$x1))

## > predict(m2, newdata = data.frame(x1 = dat$x1))
##       1        2        3        4        5        6        7        8
## 17.78563 18.49806 17.50719 19.70093 17.45071 19.69718 18.84137 18.89971

The "fix" I'm testing now is the following new function, "model.data".
which tries to re-create the data object that would be
consistent with a fitted model. This follows a suggestion from
Bill Dunlap in r-help on 2012-04-22



##' Creates a "raw" (UNTRANSFORMED) data frame equivalent
##' to the input data that would be required to fit the given model.
##'
##' Unlike model.frame and model.matrix, this does not return transformed
##' variables.
##'
##' @param model A fitted regression model in which the data argument
##' is specified. This function will fail if the model was not fit
##' with the data option.
##' @return A data frame
##' @export
##' @author Paul E. Johnson <pauljohn@@ku.edu>
##' @example inst/examples/model.data-ex.R
model.data <- function(model){
    fmla <- formula(model)
    allnames <- all.vars(fmla) ## all variable names
    ## indep variables, includes d in poly(x,d)
    ivnames <- all.vars(formula(delete.response(terms(model))))
    ## datOrig: original data frame
    datOrig <-  eval(model$call$data, environment(formula(model)))
    if (is.null(datOrig))stop("model.data: input model has no data frame")
    ## datOrig: almost right, but includes d in poly(x, d)
    dat <- get_all_vars(fmla, datOrig)
    ## Get rid of "d" and other "non variable" variable names that are
not in datOrig:
    keepnames <- intersect(names(dat), names(datOrig))
    ## Keep only rows actually used in model fit, and the correct columns
    dat <- dat[ row.names(model$model) , keepnames]
    ## keep ivnames that exist in datOrig
    attr(dat, "ivnames") <- intersect(ivnames, names(datOrig))
    invisible(dat)
}


This works for the test cases like log(10+x) and so forth:

## Examples:

head(m1.data <- model.data(m1))

head(m2.data <- model.data(m2))

## > head(m1.data <- model.data(m1))
##          y          x1 x2
## 1 18.53846  0.46176539  8
## 2 28.24759  0.09720934  7
## 3 23.88184  0.67602556  9
## 4 23.50130 -0.74877054  8
## 5 25.81714  1.02555255  5
## 6 24.75052 -0.69659539  6
## > head(m2.data <- model.data(m2))
##          y          x1
## 1 18.53846  0.46176539
## 2 28.24759  0.09720934
## 3 23.88184  0.67602556
## 4 23.50130 -0.74877054
## 5 25.81714  1.02555255
## 6 24.75052 -0.69659539


d <- 2
m4 <- lm(y ~ poly(x1,d), data=dat)

head(m4.data <- model.data(m4))

##          y          x1
## 1 18.53846  0.46176539
## 2 28.24759  0.09720934
## 3 23.88184  0.67602556

Another strength of this approach is that the return object has an
attribute "ivnames".  If R's termplot used model.dat instead of the
carrier functions, this would make for a much tighter set of code.

What flaws do you see in this?

One flaw is that I did not know how to re-construct data from the
parent environment, so I insist the regression model has to have a
data argument. Is this necessary, or can one of the R experts help.

Another possible flaw: I'm keeping the columns from the data frame
that are needed to re-construct the model.frame, and to match
the rows, I'm using row.names for the model.frame.

Are there other formulae that will confuse this scheme?

If somebody in R Core would like this and think about putting it, or
something like it, into the base, then many chores involving predicted
values would become much easier.

PJ
-- 
Paul E. Johnson
Professor, Political Science    Assoc. Director
1541 Lilac Lane, Room 504     Center for Research Methods
University of Kansas               University of Kansas
http://pj.freefaculty.org            http://quant.ku.edu



More information about the R-devel mailing list