[R-pkg-devel] Determine subset from glm object

Charles Geyer ch@rlie @ending from @t@t@umn@edu
Sun Jul 8 21:25:14 CEST 2018

I spoke too soon.  The problem isn't that I don't know how to get the
subset argument. I am just calling glm (via eval) with (mostly) the
same arguments as the call to my function, so subset is (if not
missing) an argument to my function too.  So I can just use it.

The problem is that I then want to call glm again fitting a subset of
the original subset (if there was one).  And when I do that glm will
refer to the original data wherever it is, and I don't have that.

if this isn't clear, here is the code as it stands now

The issue is with the lines (very near the end)

subset.lcm <- as.integer(rownames(modmat))
subset.lcm <- subset.lcm[linearity]
# call glm again
call.glm$subset <- subset.lcm
gout.lcm <- eval(call.glm, parent.frame())

I can see from what Duncan said that I really don't want the
as.integer around rownames.  But it is not clear what would be better.

I just had another thought that I could get the original data with
another call to glm with subset removed from the call and method =
"model.frame" added.  And I think (maybe, have to try it) that it
would have NA's removed or whatever na.action says to do.
But that seems redundant.

On Sun, Jul 8, 2018, 1:04 PM Charles Geyer <charlie using stat.umn.edu> wrote:
> I think your second option sounds better because this is all happening inside one function I'm writing so users won't be able mess with the glm object. Many thanks.
> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>> On 08/07/2018 11:48 AM, Charles Geyer wrote:
>> > I need to find out from an object returned by R function glm with argument
>> > x = TRUE
>> > what the subsetting was.  It appears that if gout is that object, then
>> >
>> > as.integer(rownames(gout$x))
>> >
>> > is a subset vector equivalent to the one actually used.
>> You don't want the "as.integer".  If the dataframe had rownames to start
>> with, the x component of the fit will have row labels consisting of
>> those labels, so as.integer may fail.  Even if it doesn't, the rownames
>> aren't necessarily sequential integers.   You can index the dataframe by
>> the character versions of the default numbers, so simply
>> rownames(gout$x) should always work.
>> More generally, I'm not sure your question is well posed.  What do you
>> mean by "the subsetting"?  If you have something like
>> df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5))
>> df1 <- subset(df, letters > "b" & letters < "y")
>> gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE)
>> the rownames(gout$x) are going to be numbers for rows of df, because df1
>> will get a subset of those as row labels.
>> > I do also have the call to glm (as a call object) so can determine the
>> > actual subset argument, but this seems to be not so useful because I don't
>> > know the length of the original variables before subsetting.
>> You should be able to evaluate the subset expression in the environment
>> of the formula, i.e.
>> eval(gout$call$subset, envir = environment(gout$formula))
>> This may give incorrect results if the variables used in subsetting
>> aren't in the dataframe and have changed since glm() was called.
>> > So now my questions.  Is this idea above (using rownames) OK even though I
>> > cannot find where (if anywhere) it is documented?  Is there a better way?
>> > One more guaranteed to be correct in the future?
>> >
>> I would trust evaluating the subset more than grabbing row labels from
>> gout$x, but I don't know for sure it is likely to be more robust.
>> Duncan Murdoch

More information about the R-package-devel mailing list