[R] Fwd: Potential Issue with lm.influence

Wed Apr 3 03:08:59 CEST 2019

Hey John,

I am aware they are high leverage points, and that the model is not the
best for them. The purpose of this dataset was to explore high leverage
points, and diagnostic statistics through which one would identify them.

What I am saying is that the current behavior of the function seems a
little non-specific to me; the influence for this problem is
finite/computable manually by fitting n models to n-1 points (manually
holding out each point individually to obtain the loo-variance, and
computing the influence in the non-approximate way).

I am just suggesting that it seems the function could be improved by, say,
throwing specific warnings when NaNs may arise. Ie, "Your have points that
are very high leverage. The approximation technique is not numerically
stable for these points and the results should be used with caution"
etc...; I am sure there are other also pre-hoc approaches to diagnose other
ways in which this function could fail). The approximation technique not
behaving well for points that are ultra high leverage just seems peculiar
that that would return an NaN with no other recommendations/advice/specific
warnings, especially since the influence is frequently used to diagnosing
this specific issue.

Alternatively, one could afford an optional argument type="manual" that
computes the held-out variance manually rather than the approximate
fashion, and add a comment to use this in the help menu when you have high
leverage points (this is what I ended up doing to obtain the true influence
and the externally studentized residual).

 I just think some more specificity could be of use for future users, to
make the R:stats community even better :) Does that make sense?

Sincerely,
Eric

On Tue, Apr 2, 2019 at 7:53 PM Fox, John <jfox using mcmaster.ca> wrote:

> Dear Eric,
>
> Have you looked at your data? -- for example:
>
>         plot(log(Moons) ~ Volume, data = moon_data)
>         text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
> subset = Volume > 400)
>
> The negative-binomial model doesn't look reasonable, does it?
>
> After you eliminate Jupiter there's one very high leverage point left,
> Saturn. Computing studentized residuals entails an approximation to
> deleting that as well from the model, so try fitting
>
>         fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
>         summary(fit3)
>
> which runs into numeric difficulties.
>
> Then look at:
>
>         plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
>
> Finally, try
>
>         plot(log(Moons) ~ log(Volume), data = moon_data)
>         fit4 <- update(fit2, . ~ log(Volume))
>         rstudent(fit4)
>
> I hope this helps,
>  John
>
> -----------------------------------------------------------------
> John Fox
> Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: https://socialsciences.mcmaster.ca/jfox/
>
>
>
>
> > -----Original Message-----
> > From: R-help [mailto:r-help-bounces using r-project.org] On Behalf Of Eric
> > Bridgeford
> > Sent: Tuesday, April 2, 2019 5:01 PM
> > To: Bert Gunter <bgunter.4567 using gmail.com>
> > Cc: R-help <r-help using r-project.org>
> > Subject: Re: [R] Fwd: Potential Issue with lm.influence
> >
> > I agree the influence documentation suggests NaNs may result; however, as
> > these can be manually computed and are, indeed, finite/existing (ie,
> > computing the held-out influence by manually training n models for n
> points
> > to obtain n leave one out influence measures), I don't possibly see how
> the
> > function SHOULD return NaN, and given that it is returning NaN, that
> > suggests to me that there should be either a) Providing an alternative
> > method to compute them that (may be slower) that returns the correct
> > results in the even that lm.influence does not return a good
> approximation
> > (ie, a command line argument for type="approx" that does the
> > approximation strategy employed currently, or an alternative
> type="direct"
> > or something like that that computes them manually), or b) a heuristic to
> > suggest why NaNs might result from one's particular inputs/what can be
> > done to fix it (if the approximation strategy is the source of the
> problem) or
> > what the issue is with the data that will cause NaNs. Hence I was
> looking to
> > start a discussion around the specific strategy employed to compute the
> > elements.
> >
> > Below is the code:
> > moon_data <- structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
> 11L,
> >                                                12L, 9L, 10L, 4L, 6L,
> 3L), .Label = c("Ceres ", "Earth",
> > "Eris ",
> >
> >          "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> "Neptune ",
> >
> >          "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> >                             Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
> 9.54, 19.22,
> >                                          30.06, 39.5, 43.35, 45.8,
> 67.7), Diameter = c(0.382, 0.949,
> >
> >            1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> >
> >            0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e-04, 317.8,
> >
> >                                  95.2, 14.6, 17.2, 0.0022, 7e-04, 7e-04,
> 0.0025), Moons = c(0L,
> >
> >
> >                 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> Volume =
> > c(0.0291869497930152,
> >
> >
> >
> >     0.447504348276571, 0.523598775598299, 0.0788376225681443,
> >
> >
> >
> >     0.000268082573106329, 737.393372232996, 441.729261571372,
> >
> >
> >
> >     33.6865588825666, 30.6549628355953, 0.00305362805928928,
> >
> >
> >
> >     0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> >
> >
> >                 )), row.names = c(NA, -13L), class = "data.frame")
> >
> > fit <- glm.nb(Moons ~ Volume, data = moon_data)
> > rstudent(fit)
> >
> > fit2 <- update(fit, subset = Name != "Jupiter ")
> > rstudent(fit2)
> >
> > influence(fit2)$sigma
> >
> > #        1        2        3        4        5        7        8        9
> >      10       11       12       13
> > # 1.077945 1.077813 1.165025 1.181685 1.077954      NaN 1.044454 1.152110
> > 1.187586 1.181696 1.077954 1.165147
> >
> > Sincerely,
> > Eric
> >
> > On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter <bgunter.4567 using gmail.com>
> > wrote:
> >
> > > Also, I suggest you read ?influence which may explain the source of
> > > your NaN's .
> > >
> > > Bert Gunter
> > >
> > > "The trouble with having an open mind is that people keep coming along
> > > and sticking things into it."
> > > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >
> > >
> > > On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter <bgunter.4567 using gmail.com>
> > wrote:
> > >
> > >> I told you already: **Include code inline **
> > >>
> > >> See ?dput for how to include a text version of objects, such as data
> > >> frames, inline.
> > >>
> > >> Otherwise, I believe .txt text files are not stripped if you insist
> > >> on
> > >> *attaching* data or code. Others may have better advice.
> > >>
> > >>
> > >> Bert Gunter
> > >>
> > >> "The trouble with having an open mind is that people keep coming
> > >> along and sticking things into it."
> > >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >>
> > >>
> > >> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford <ericwb95 using gmail.com>
> > >> wrote:
> > >>
> > >>> How can I add attachments? The following two files were attached in
> > >>> the initial message
> > >>>
> > >>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter <bgunter.4567 using gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Nothing was attached. The r-help server strips most attachments.
> > >>>> Include your code inline.
> > >>>>
> > >>>> Also note that
> > >>>>
> > >>>> > 0/0
> > >>>> [1] NaN
> > >>>>
> > >>>> so maybe something like that occurs in the course of your
> calculations.
> > >>>> But that's just a guess, so feel free to disregard.
> > >>>>
> > >>>>
> > >>>> Bert Gunter
> > >>>>
> > >>>> "The trouble with having an open mind is that people keep coming
> > >>>> along and sticking things into it."
> > >>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >>>>
> > >>>>
> > >>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> > >>>> <ericwb95 using gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi R core team,
> > >>>>>
> > >>>>> I experienced the following issue with the attached data/code
> > >>>>> snippet, where the studentized residual for a single observation
> > >>>>> appears to be NaN given finite predictors/responses, which appears
> > >>>>> to be driven by the glm.influence method in the stats package. I
> > >>>>> am curious to whether this is a consequence of the specific
> > >>>>> implementation used for computing the influence, which it would
> > >>>>> appear is the driving force for the NaN influence for the point,
> > >>>>> that I was ultimately able to trace back through the lm.influence
> > >>>>> method to this specific line <
> > >>>>> https://github.com/SurajGupta/r-
> > source/blob/a28e609e72ed7c47f6ddfb
> > >>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> > >>>>> >
> > >>>>> which
> > >>>>> calls C code which calls iminfl.f
> > >>>>> <
> > >>>>> https://github.com/SurajGupta/r-source/blob/master/src/library/sta
> > >>>>> ts/src/lminfl.f
> > >>>>> >
> > >>>>> (I
> > >>>>> don't know fortran so I can't debug further). My understanding is
> > >>>>> that the specific issue would have to do with the leave-one-out
> > >>>>> variance estimate associated with this particular point, which it
> > >>>>> seems based on my understanding should be finite given finite
> > >>>>> predictors/responses. Let me know. Thanks!
> > >>>>>
> > >>>>> Sincerely,
> > >>>>>
> > >>>>> --
> > >>>>> Eric Bridgeford
> > >>>>> ericwb.me
> > >>>>> ______________________________________________
> > >>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>>>> PLEASE do read the posting guide
> > >>>>> http://www.R-project.org/posting-guide.html
> > >>>>> and provide commented, minimal, self-contained, reproducible code.
> > >>>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Eric Bridgeford
> > >>> ericwb.me
> > >>>
> > >>
> >
> > --
> > Eric Bridgeford
> > ericwb.me
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

-- 
Eric Bridgeford
ericwb.me

	[[alternative HTML version deleted]]