[R-pkg-devel] Determine subset from glm object

Mon Jul 9 23:45:07 CEST 2018

On second thoughts it may be better to preserve the original data and na.action in the call to glm. So then you might combine the idea of a dummy model frame with evaluating the subset, e.g.

mfcall <- call("model.frame", reformulate(all.vars(f)), data = data)
mf <- eval(mfcall, parent.frame())
mf$id <- seq_len(nrow(mf))
subset <- mf$id %in% model.frame( ~ id, data = mf, subset = subset)$id

This will give the subset as a logical vector whether it was originally supplied as logical, numeric or character. Then you might combine this with the logical vector based on the first glm as follows:

subset[subset] <- linearity

On Mon, Jul 9, 2018, at 10:14 PM, Heather Turner wrote:
> Good point. In that case a solution might be to create a model frame 
> based on the named variables, e.g.
> 
> #  general formula
> f <- ~ log(x) + ns(v, df = 2)
> # model frame based on "bare" variables; deal with user-supplied subset, 
> data, na.action, etc
> mfcall <- call("model.frame", reformulate(all.vars(f)), subset = subset, 
> data = data, na.action = na.action)
> mf <- eval(mfcall, parent.frame())
> 
> Then `mf` can be passed as the data argument to `glm` without any subset 
> argument for the first model and with the new subset argument for the 
> second model.
> 
> 
> On Mon, Jul 9, 2018, at 5:06 PM, Ben Bolker wrote:
> > 
> >   From painful experience: model.frame() does *NOT* necessarily return a
> > data frame that can be successfully used as the data= argument for models.
> > 
> >   - transformed variables (e.g. log(x)) will be in the model frame
> > rather than the original variables,  so when model.frame() is called
> > again within glm(), it won't find the original variables
> >   - variables with data-dependent bases (poly(), ns(), etc.) get
> > computed and stuck in the model frame - again, the original variables
> > are inaccessible
> > 
> > 
> > On 2018-07-09 11:20 AM, Heather Turner wrote:
> > > 
> > > 
> > > On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote:
> > >> I spoke too soon.  The problem isn't that I don't know how to get the
> > >> subset argument. I am just calling glm (via eval) with (mostly) the
> > >> same arguments as the call to my function, so subset is (if not
> > >> missing) an argument to my function too.  So I can just use it.
> > >>
> > >> The problem is that I then want to call glm again fitting a subset of
> > >> the original subset (if there was one).  And when I do that glm will
> > >> refer to the original data wherever it is, and I don't have that.
> > >>
> > >> if this isn't clear, here is the code as it stands now
> > >> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R.
> > >>
> > >> The issue is with the lines (very near the end)
> > >>
> > >> subset.lcm <- as.integer(rownames(modmat))
> > >> subset.lcm <- subset.lcm[linearity]
> > >> # call glm again
> > >> call.glm$subset <- subset.lcm
> > >> gout.lcm <- eval(call.glm, parent.frame())
> > >>
> > >> I can see from what Duncan said that I really don't want the
> > >> as.integer around rownames.  But it is not clear what would be better.
> > >>
> > >> I just had another thought that I could get the original data with
> > >> another call to glm with subset removed from the call and method =
> > >> "model.frame" added.  And I think (maybe, have to try it) that it
> > >> would have NA's removed or whatever na.action says to do.
> > >> But that seems redundant.
> > >>
> > >>
> > > As you are calling stats::glm, you can use `model.frame` to get the data used to fit the model after applying subset and na.action. So then you can do:
> > > 
> > > call.glm$subset <- linearity
> > > call.glm$data <- model.frame(gout)
> > > 
> > > I think this is what you are after?
> > > 
> > > Heather
> > > 
> > >>
> > >> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer <charlie using stat.umn.edu> wrote:
> > >>>
> > >>> I think your second option sounds better because this is all happening inside one function I'm writing so users won't be able mess with the glm object. Many thanks.
> > >>>
> > >>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
> > >>>>
> > >>>> On 08/07/2018 11:48 AM, Charles Geyer wrote:
> > >>>>> I need to find out from an object returned by R function glm with argument
> > >>>>> x = TRUE
> > >>>>> what the subsetting was.  It appears that if gout is that object, then
> > >>>>>
> > >>>>> as.integer(rownames(gout$x))
> > >>>>>
> > >>>>> is a subset vector equivalent to the one actually used.
> > >>>>
> > >>>> You don't want the "as.integer".  If the dataframe had rownames to start
> > >>>> with, the x component of the fit will have row labels consisting of
> > >>>> those labels, so as.integer may fail.  Even if it doesn't, the rownames
> > >>>> aren't necessarily sequential integers.   You can index the dataframe by
> > >>>> the character versions of the default numbers, so simply
> > >>>> rownames(gout$x) should always work.
> > >>>>
> > >>>> More generally, I'm not sure your question is well posed.  What do you
> > >>>> mean by "the subsetting"?  If you have something like
> > >>>>
> > >>>> df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5))
> > >>>>
> > >>>> df1 <- subset(df, letters > "b" & letters < "y")
> > >>>>
> > >>>> gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE)
> > >>>>
> > >>>> the rownames(gout$x) are going to be numbers for rows of df, because df1
> > >>>> will get a subset of those as row labels.
> > >>>>
> > >>>>
> > >>>>> I do also have the call to glm (as a call object) so can determine the
> > >>>>> actual subset argument, but this seems to be not so useful because I don't
> > >>>>> know the length of the original variables before subsetting.
> > >>>>
> > >>>> You should be able to evaluate the subset expression in the environment
> > >>>> of the formula, i.e.
> > >>>>
> > >>>> eval(gout$call$subset, envir = environment(gout$formula))
> > >>>>
> > >>>> This may give incorrect results if the variables used in subsetting
> > >>>> aren't in the dataframe and have changed since glm() was called.
> > >>>>
> > >>>>
> > >>>>> So now my questions.  Is this idea above (using rownames) OK even though I
> > >>>>> cannot find where (if anywhere) it is documented?  Is there a better way?
> > >>>>> One more guaranteed to be correct in the future?
> > >>>>>
> > >>>>
> > >>>> I would trust evaluating the subset more than grabbing row labels from
> > >>>> gout$x, but I don't know for sure it is likely to be more robust.
> > >>>>
> > >>>> Duncan Murdoch
> > >>
> > >> ______________________________________________
> > >> R-package-devel using r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> > > 
> > > ______________________________________________
> > > R-package-devel using r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-package-devel
> > >
> > 
> > ______________________________________________
> > R-package-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel