[R-pkg-devel] Determine subset from glm object

Ben Bolker bbolker @ending from gm@il@com
Mon Jul 9 18:06:23 CEST 2018


  From painful experience: model.frame() does *NOT* necessarily return a
data frame that can be successfully used as the data= argument for models.

  - transformed variables (e.g. log(x)) will be in the model frame
rather than the original variables,  so when model.frame() is called
again within glm(), it won't find the original variables
  - variables with data-dependent bases (poly(), ns(), etc.) get
computed and stuck in the model frame - again, the original variables
are inaccessible


On 2018-07-09 11:20 AM, Heather Turner wrote:
> 
> 
> On Sun, Jul 8, 2018, at 8:25 PM, Charles Geyer wrote:
>> I spoke too soon.  The problem isn't that I don't know how to get the
>> subset argument. I am just calling glm (via eval) with (mostly) the
>> same arguments as the call to my function, so subset is (if not
>> missing) an argument to my function too.  So I can just use it.
>>
>> The problem is that I then want to call glm again fitting a subset of
>> the original subset (if there was one).  And when I do that glm will
>> refer to the original data wherever it is, and I don't have that.
>>
>> if this isn't clear, here is the code as it stands now
>> https://github.com/cjgeyer/glmdr/blob/master/package/glmdr/R/glmdr.R.
>>
>> The issue is with the lines (very near the end)
>>
>> subset.lcm <- as.integer(rownames(modmat))
>> subset.lcm <- subset.lcm[linearity]
>> # call glm again
>> call.glm$subset <- subset.lcm
>> gout.lcm <- eval(call.glm, parent.frame())
>>
>> I can see from what Duncan said that I really don't want the
>> as.integer around rownames.  But it is not clear what would be better.
>>
>> I just had another thought that I could get the original data with
>> another call to glm with subset removed from the call and method =
>> "model.frame" added.  And I think (maybe, have to try it) that it
>> would have NA's removed or whatever na.action says to do.
>> But that seems redundant.
>>
>>
> As you are calling stats::glm, you can use `model.frame` to get the data used to fit the model after applying subset and na.action. So then you can do:
> 
> call.glm$subset <- linearity
> call.glm$data <- model.frame(gout)
> 
> I think this is what you are after?
> 
> Heather
> 
>>
>> On Sun, Jul 8, 2018, 1:04 PM Charles Geyer <charlie using stat.umn.edu> wrote:
>>>
>>> I think your second option sounds better because this is all happening inside one function I'm writing so users won't be able mess with the glm object. Many thanks.
>>>
>>> On Sun, Jul 8, 2018, 12:10 PM Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>>>
>>>> On 08/07/2018 11:48 AM, Charles Geyer wrote:
>>>>> I need to find out from an object returned by R function glm with argument
>>>>> x = TRUE
>>>>> what the subsetting was.  It appears that if gout is that object, then
>>>>>
>>>>> as.integer(rownames(gout$x))
>>>>>
>>>>> is a subset vector equivalent to the one actually used.
>>>>
>>>> You don't want the "as.integer".  If the dataframe had rownames to start
>>>> with, the x component of the fit will have row labels consisting of
>>>> those labels, so as.integer may fail.  Even if it doesn't, the rownames
>>>> aren't necessarily sequential integers.   You can index the dataframe by
>>>> the character versions of the default numbers, so simply
>>>> rownames(gout$x) should always work.
>>>>
>>>> More generally, I'm not sure your question is well posed.  What do you
>>>> mean by "the subsetting"?  If you have something like
>>>>
>>>> df <- data.frame(letters, x = 1:26, y = rbinom(26, 1, 0.5))
>>>>
>>>> df1 <- subset(df, letters > "b" & letters < "y")
>>>>
>>>> gout <- glm(y ~ x, data = df1, subset = letters < "q", x = TRUE)
>>>>
>>>> the rownames(gout$x) are going to be numbers for rows of df, because df1
>>>> will get a subset of those as row labels.
>>>>
>>>>
>>>>> I do also have the call to glm (as a call object) so can determine the
>>>>> actual subset argument, but this seems to be not so useful because I don't
>>>>> know the length of the original variables before subsetting.
>>>>
>>>> You should be able to evaluate the subset expression in the environment
>>>> of the formula, i.e.
>>>>
>>>> eval(gout$call$subset, envir = environment(gout$formula))
>>>>
>>>> This may give incorrect results if the variables used in subsetting
>>>> aren't in the dataframe and have changed since glm() was called.
>>>>
>>>>
>>>>> So now my questions.  Is this idea above (using rownames) OK even though I
>>>>> cannot find where (if anywhere) it is documented?  Is there a better way?
>>>>> One more guaranteed to be correct in the future?
>>>>>
>>>>
>>>> I would trust evaluating the subset more than grabbing row labels from
>>>> gout$x, but I don't know for sure it is likely to be more robust.
>>>>
>>>> Duncan Murdoch
>>
>> ______________________________________________
>> R-package-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>



More information about the R-package-devel mailing list