[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol sokol at insa-toulouse.fr
Wed May 31 17:30:44 CEST 2017


Le 31/05/2017 à 16:39, Joris Meys a écrit :
> Seriously, if a method gives a wrong result, it's wrong.
I did not understand why you and others were using term "wrong"
based on something that I was considering as just "different" implementation.
More thorough reading revealed that I have overlooked this phrase in the
line's doc: "left and right /thirds/ of the data" (emphasis is mine).

Should I be exiled to Excel department for this sin? That's tough ;)
Serguei.

> line() does NOT implement the algorithm of Tukey, even not after the patch. We're not discussing Excel here, are we?
>
> The method of Tukey is rather clear, and it is NOT using the default quantile definition from the quantile function. Actually, it doesn't even use quantiles 
> to define the groups. It just says that the groups should be more or less equally spaced. As the method of Tukey relies on the medians of the subgroups, it 
> would make sense to pick a method that is approximately unbiased with regard to the median. That would be type 8 imho.
>
> To get the size of the outer groups, Tukey would've been more than happy enough with a:
>
> > floor(length(dfr$time) / 3)
> [1] 6
>
> There you have the size of your left and right group, and now we can discuss about which median type should be used for the robust fitting.
>
> But I can honestly not understand why anyone in his right mind would defend a method that is clearly wrong while not working at Microsoft's spreadsheet 
> department.
>
> Cheers
> Joris
>
> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <sokol at insa-toulouse.fr <mailto:sokol at insa-toulouse.fr>> wrote:
>
>     Le 31/05/2017 à 15:40, Joris Meys a écrit :
>
>         OTOH,
>
>         > sapply(1:9, function(i){
>         +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>         + })
>         [1] 8 8 6 6 6 6 8 6 6
>
>         Only the default (type = 7) and the first two types give the result lines() gives now. I think there is plenty of reasons to give why any of the other
>         6 types might be better suited in Tukey's method.
>
>         So to my mind, chaning the definition of line() to give sensible output that is in accordance with the theory, does not imply any inconsistency with
>         the quantile definition in R. At least not with 6 out of the 9 different ones ;-)
>
>     Nice shot.
>     But OTOE (on the other end ;)
>     > sapply(1:9, function(i){
>     +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
>     + })
>     [1] 8 8 8 8 6 6 8 6 6
>
>     Here "8" gains 5 votes against 4 for "6". There were two defector methods
>     that changed the point number and should be discarded. Which leaves us
>     with the score 3:4, still in favor of "6" but the default method should prevail
>     in my sens.
>
>     Serguei.
>
>
>
>
> -- 
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php


-- 
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: sokol at insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr



More information about the R-devel mailing list