[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Martin Maechler maechler at stat.math.ethz.ch
Tue May 30 18:51:08 CEST 2017


>>>>> Serguei Sokol <sokol at insa-toulouse.fr>
>>>>>     on Tue, 30 May 2017 16:01:17 +0200 writes:

    > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ...
    >> However, even after the patch, The example from the SO
    >> post differs from the result of Richie Cotton's
    >> function...
    > The explanation is quite simple. In SO function, the first
    > 1/3 quantile of used example counts 6 points (of 19 in
    > total), while line()'s definition of quantile leads to 8
    > points. The same numbers (6 and 8) are on the other end of
    > sample. 

so the number of obs. for the three thirds for line() are
   {8, 3, 8}  in line()  [also, after your patch, right?]

whereas in MMline() they are as they should be, namely

   {6, 7, 6}

But the  {8, 3, 8}  split is not at all what all "the literature",
including Tukey himself says that "should" be done.
(Other literature on the topic suggests that the optimal sizes
 of the split in three groups depends on the distribution of x ..)

OTOH, MMline() does exactly what "the literature" and also  the
reference on the  ?line  help pages says.

    > In x sample, there are few repeated values, this
    > is certainly be the reason of different quantiles..

    > I am not sure that one quantile definition is better or
    > more correct than the other. 

    > So I would leave line()'s definition as is.

you mean  _after_ applying your patch, I assume.

I currently tend do disagree. If we change line() we should
rather fix more ..
Note the 'Subject' you've chosen for this thread,
 "... does not produce the correct Tukey line",
so I think we should get better.

Apart from Richie / my  MMline() function, I've also noticed
that   ACSWR :: resistant_line()
exists.

However "the literature" (see references below), notably the two
with Hoaglin, strongly  recommends smarter iterations, and
-- lo and behold! -- when this topic came up last (for me) in
Dec. 2014, I did spend about 2 days work (or more?) to get the
FORTRAN code from the 1981 - book (which is abbreviated the
"ABC of EDA") from a somewhat useful OCR scan into compilable
Fortran code and then f2c'ed, wrote an R interface function
found problems i.e., bugs, including infinite loops, fixed most
AFAICS, but somehow did not finish making the result available.

Yes, and I have too many other things on my desk... this will
have to wait!

References:

     Tukey, J. W. (1977).  _Exploratory Data Analysis_, Reading
     Massachusetts: Addison-Wesley.

     Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics
     and Computing of Exploratory Data Analysis_ Duxbury Press.

     Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y
     versus x.  Chapter 5 of _Understanding Robust and Exploratory Data
     Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W.
     Tukey.  Wiley.

     Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line
     and Related Regression Methods.  _Journal of the American
     Statistical Association_ *80*, 1041-1054.  <URL:
     https://dx.doi.org/10.1080/01621459.1985.10478222>


    > Best, Sergueï.

Martin Maechler, ETH Zurich (and R core team)



More information about the R-devel mailing list