[R] Potential Issue with lm.influence

Wed Apr 3 18:57:29 CEST 2019

Second!

Bert Gunter

On Wed, Apr 3, 2019 at 9:35 AM Richard M. Heiberger <rmh using temple.edu> wrote:

> fortune nomination.
>
>
> The lesson to me here is that if you fit a sufficiently unreasonable
> model to data, the computations may break down.
>
> On Wed, Apr 3, 2019 at 10:18 AM Fox, John <jfox using mcmaster.ca> wrote:
> >
> > Dear Eric,
> >
> > I'm afraid that your argument doesn't make sense to me. As you saw when
> you tried
> >
> >         fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn
> ")))
> >
> > glm.nb() effectively wasn't able to estimate the theta parameter of the
> negative binomial model. So why would it be better to base deletion
> diagnostics on actually refitting the model?
> >
> > The lesson to me here is that if you fit a sufficiently unreasonable
> model to data, the computations may break down. Other than drawing
> attention to the NaN with an explicit warning, I don't see what more could
> usefully be done.
> >
> > Best,
> >  John
> >
> > > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford <ericwb95 using gmail.com>
> wrote:
> > >
> > > Hey John,
> > >
> > > I am aware they are high leverage points, and that the model is not the
> > > best for them. The purpose of this dataset was to explore high leverage
> > > points, and diagnostic statistics through which one would identify
> them.
> > >
> > > What I am saying is that the current behavior of the function seems a
> > > little non-specific to me; the influence for this problem is
> > > finite/computable manually by fitting n models to n-1 points (manually
> > > holding out each point individually to obtain the loo-variance, and
> > > computing the influence in the non-approximate way).
> > >
> > > I am just suggesting that it seems the function could be improved by,
> say,
> > > throwing specific warnings when NaNs may arise. Ie, "Your have points
> that
> > > are very high leverage. The approximation technique is not numerically
> > > stable for these points and the results should be used with caution"
> > > etc...; I am sure there are other also pre-hoc approaches to diagnose
> other
> > > ways in which this function could fail). The approximation technique
> not
> > > behaving well for points that are ultra high leverage just seems
> peculiar
> > > that that would return an NaN with no other
> recommendations/advice/specific
> > > warnings, especially since the influence is frequently used to
> diagnosing
> > > this specific issue.
> > >
> > > Alternatively, one could afford an optional argument type="manual" that
> > > computes the held-out variance manually rather than the approximate
> > > fashion, and add a comment to use this in the help menu when you have
> high
> > > leverage points (this is what I ended up doing to obtain the true
> influence
> > > and the externally studentized residual).
> > >
> > > I just think some more specificity could be of use for future users, to
> > > make the R:stats community even better :) Does that make sense?
> > >
> > > Sincerely,
> > > Eric
> > >
> > > On Tue, Apr 2, 2019 at 7:53 PM Fox, John <jfox using mcmaster.ca> wrote:
> > >
> > >> Dear Eric,
> > >>
> > >> Have you looked at your data? -- for example:
> > >>
> > >>        plot(log(Moons) ~ Volume, data = moon_data)
> > >>        text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
> > >> subset = Volume > 400)
> > >>
> > >> The negative-binomial model doesn't look reasonable, does it?
> > >>
> > >> After you eliminate Jupiter there's one very high leverage point left,
> > >> Saturn. Computing studentized residuals entails an approximation to
> > >> deleting that as well from the model, so try fitting
> > >>
> > >>        fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn
> ")))
> > >>        summary(fit3)
> > >>
> > >> which runs into numeric difficulties.
> > >>
> > >> Then look at:
> > >>
> > >>        plot(log(Moons) ~ Volume, data = moon_data, subset = Volume <
> 400)
> > >>
> > >> Finally, try
> > >>
> > >>        plot(log(Moons) ~ log(Volume), data = moon_data)
> > >>        fit4 <- update(fit2, . ~ log(Volume))
> > >>        rstudent(fit4)
> > >>
> > >> I hope this helps,
> > >> John
> > >>
> > >> -----------------------------------------------------------------
> > >> John Fox
> > >> Professor Emeritus
> > >> McMaster University
> > >> Hamilton, Ontario, Canada
> > >> Web: https://socialsciences.mcmaster.ca/jfox/
> > >>
> > >>
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: R-help [mailto:r-help-bounces using r-project.org] On Behalf Of Eric
> > >>> Bridgeford
> > >>> Sent: Tuesday, April 2, 2019 5:01 PM
> > >>> To: Bert Gunter <bgunter.4567 using gmail.com>
> > >>> Cc: R-help <r-help using r-project.org>
> > >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
> > >>>
> > >>> I agree the influence documentation suggests NaNs may result;
> however, as
> > >>> these can be manually computed and are, indeed, finite/existing (ie,
> > >>> computing the held-out influence by manually training n models for n
> > >> points
> > >>> to obtain n leave one out influence measures), I don't possibly see
> how
> > >> the
> > >>> function SHOULD return NaN, and given that it is returning NaN, that
> > >>> suggests to me that there should be either a) Providing an
> alternative
> > >>> method to compute them that (may be slower) that returns the correct
> > >>> results in the even that lm.influence does not return a good
> > >> approximation
> > >>> (ie, a command line argument for type="approx" that does the
> > >>> approximation strategy employed currently, or an alternative
> > >> type="direct"
> > >>> or something like that that computes them manually), or b) a
> heuristic to
> > >>> suggest why NaNs might result from one's particular inputs/what can
> be
> > >>> done to fix it (if the approximation strategy is the source of the
> > >> problem) or
> > >>> what the issue is with the data that will cause NaNs. Hence I was
> > >> looking to
> > >>> start a discussion around the specific strategy employed to compute
> the
> > >>> elements.
> > >>>
> > >>> Below is the code:
> > >>> moon_data <- structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L,
> 5L,
> > >> 11L,
> > >>>                                               12L, 9L, 10L, 4L, 6L,
> > >> 3L), .Label = c("Ceres ", "Earth",
> > >>> "Eris ",
> > >>>
> > >>>         "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> > >> "Neptune ",
> > >>>
> > >>>         "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> > >>>                            Distance = c(0.39, 0.72, 1, 1.52, 2.75,
> 5.2,
> > >> 9.54, 19.22,
> > >>>                                         30.06, 39.5, 43.35, 45.8,
> > >> 67.7), Diameter = c(0.382, 0.949,
> > >>>
> > >>>           1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> > >>>
> > >>>           0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e-04, 317.8,
> > >>>
> > >>>                                 95.2, 14.6, 17.2, 0.0022, 7e-04,
> 7e-04,
> > >> 0.0025), Moons = c(0L,
> > >>>
> > >>>
> > >>>                0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> > >> Volume =
> > >>> c(0.0291869497930152,
> > >>>
> > >>>
> > >>>
> > >>>    0.447504348276571, 0.523598775598299, 0.0788376225681443,
> > >>>
> > >>>
> > >>>
> > >>>    0.000268082573106329, 737.393372232996, 441.729261571372,
> > >>>
> > >>>
> > >>>
> > >>>    33.6865588825666, 30.6549628355953, 0.00305362805928928,
> > >>>
> > >>>
> > >>>
> > >>>    0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> > >>>
> > >>>
> > >>>                )), row.names = c(NA, -13L), class = "data.frame")
> > >>>
> > >>> fit <- glm.nb(Moons ~ Volume, data = moon_data)
> > >>> rstudent(fit)
> > >>>
> > >>> fit2 <- update(fit, subset = Name != "Jupiter ")
> > >>> rstudent(fit2)
> > >>>
> > >>> influence(fit2)$sigma
> > >>>
> > >>> #        1        2        3        4        5        7        8
>     9
> > >>>     10       11       12       13
> > >>> # 1.077945 1.077813 1.165025 1.181685 1.077954      NaN 1.044454
> 1.152110
> > >>> 1.187586 1.181696 1.077954 1.165147
> > >>>
> > >>> Sincerely,
> > >>> Eric
> > >>>
> > >>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter <bgunter.4567 using gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Also, I suggest you read ?influence which may explain the source of
> > >>>> your NaN's .
> > >>>>
> > >>>> Bert Gunter
> > >>>>
> > >>>> "The trouble with having an open mind is that people keep coming
> along
> > >>>> and sticking things into it."
> > >>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >>>>
> > >>>>
> > >>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter <bgunter.4567 using gmail.com>
> > >>> wrote:
> > >>>>
> > >>>>> I told you already: **Include code inline **
> > >>>>>
> > >>>>> See ?dput for how to include a text version of objects, such as
> data
> > >>>>> frames, inline.
> > >>>>>
> > >>>>> Otherwise, I believe .txt text files are not stripped if you insist
> > >>>>> on
> > >>>>> *attaching* data or code. Others may have better advice.
> > >>>>>
> > >>>>>
> > >>>>> Bert Gunter
> > >>>>>
> > >>>>> "The trouble with having an open mind is that people keep coming
> > >>>>> along and sticking things into it."
> > >>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford <ericwb95 using gmail.com
> >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> How can I add attachments? The following two files were attached
> in
> > >>>>>> the initial message
> > >>>>>>
> > >>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter <
> bgunter.4567 using gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Nothing was attached. The r-help server strips most attachments.
> > >>>>>>> Include your code inline.
> > >>>>>>>
> > >>>>>>> Also note that
> > >>>>>>>
> > >>>>>>>> 0/0
> > >>>>>>> [1] NaN
> > >>>>>>>
> > >>>>>>> so maybe something like that occurs in the course of your
> > >> calculations.
> > >>>>>>> But that's just a guess, so feel free to disregard.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Bert Gunter
> > >>>>>>>
> > >>>>>>> "The trouble with having an open mind is that people keep coming
> > >>>>>>> along and sticking things into it."
> > >>>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
> )
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> > >>>>>>> <ericwb95 using gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi R core team,
> > >>>>>>>>
> > >>>>>>>> I experienced the following issue with the attached data/code
> > >>>>>>>> snippet, where the studentized residual for a single observation
> > >>>>>>>> appears to be NaN given finite predictors/responses, which
> appears
> > >>>>>>>> to be driven by the glm.influence method in the stats package. I
> > >>>>>>>> am curious to whether this is a consequence of the specific
> > >>>>>>>> implementation used for computing the influence, which it would
> > >>>>>>>> appear is the driving force for the NaN influence for the point,
> > >>>>>>>> that I was ultimately able to trace back through the
> lm.influence
> > >>>>>>>> method to this specific line <
> > >>>>>>>> https://github.com/SurajGupta/r-
> > >>> source/blob/a28e609e72ed7c47f6ddfb
> > >>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> > >>>>>>>>>
> > >>>>>>>> which
> > >>>>>>>> calls C code which calls iminfl.f
> > >>>>>>>> <
> > >>>>>>>>
> https://github.com/SurajGupta/r-source/blob/master/src/library/sta
> > >>>>>>>> ts/src/lminfl.f
> > >>>>>>>>>
> > >>>>>>>> (I
> > >>>>>>>> don't know fortran so I can't debug further). My understanding
> is
> > >>>>>>>> that the specific issue would have to do with the leave-one-out
> > >>>>>>>> variance estimate associated with this particular point, which
> it
> > >>>>>>>> seems based on my understanding should be finite given finite
> > >>>>>>>> predictors/responses. Let me know. Thanks!
> > >>>>>>>>
> > >>>>>>>> Sincerely,
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Eric Bridgeford
> > >>>>>>>> ericwb.me
> > >>>>>>>> ______________________________________________
> > >>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> see
> > >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>>>>>>> PLEASE do read the posting guide
> > >>>>>>>> http://www.R-project.org/posting-guide.html
> > >>>>>>>> and provide commented, minimal, self-contained, reproducible
> code.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Eric Bridgeford
> > >>>>>> ericwb.me
> > >>>>>>
> > >>>>>
> > >>>
> > >>> --
> > >>> Eric Bridgeford
> > >>> ericwb.me
> > >>>
> > >>>      [[alternative HTML version deleted]]
> > >>>
> > >>> ______________________________________________
> > >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>> PLEASE do read the posting guide http://www.R-project.org/posting-
> > >>> guide.html
> > >>> and provide commented, minimal, self-contained, reproducible code.
> > >>
> > >
> > >
> > > --
> > > Eric Bridgeford
> > > ericwb.me
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]