[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Serguei Sokol
sokol at insa-toulouse.fr
Wed May 31 17:30:44 CEST 2017
Le 31/05/2017 à 16:39, Joris Meys a écrit :
> Seriously, if a method gives a wrong result, it's wrong.
I did not understand why you and others were using term "wrong"
based on something that I was considering as just "different" implementation.
More thorough reading revealed that I have overlooked this phrase in the
line's doc: "left and right /thirds/ of the data" (emphasis is mine).
Should I be exiled to Excel department for this sin? That's tough ;)
Serguei.
> line() does NOT implement the algorithm of Tukey, even not after the patch. We're not discussing Excel here, are we?
>
> The method of Tukey is rather clear, and it is NOT using the default quantile definition from the quantile function. Actually, it doesn't even use quantiles
> to define the groups. It just says that the groups should be more or less equally spaced. As the method of Tukey relies on the medians of the subgroups, it
> would make sense to pick a method that is approximately unbiased with regard to the median. That would be type 8 imho.
>
> To get the size of the outer groups, Tukey would've been more than happy enough with a:
>
> > floor(length(dfr$time) / 3)
> [1] 6
>
> There you have the size of your left and right group, and now we can discuss about which median type should be used for the robust fitting.
>
> But I can honestly not understand why anyone in his right mind would defend a method that is clearly wrong while not working at Microsoft's spreadsheet
> department.
>
> Cheers
> Joris
>
> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <sokol at insa-toulouse.fr <mailto:sokol at insa-toulouse.fr>> wrote:
>
> Le 31/05/2017 à 15:40, Joris Meys a écrit :
>
> OTOH,
>
> > sapply(1:9, function(i){
> + sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
> + })
> [1] 8 8 6 6 6 6 8 6 6
>
> Only the default (type = 7) and the first two types give the result lines() gives now. I think there is plenty of reasons to give why any of the other
> 6 types might be better suited in Tukey's method.
>
> So to my mind, chaning the definition of line() to give sensible output that is in accordance with the theory, does not imply any inconsistency with
> the quantile definition in R. At least not with 6 out of the 9 different ones ;-)
>
> Nice shot.
> But OTOE (on the other end ;)
> > sapply(1:9, function(i){
> + sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
> + })
> [1] 8 8 8 8 6 6 8 6 6
>
> Here "8" gains 5 votes against 4 for "6". There were two defector methods
> that changed the point number and should be discarded. Which leaves us
> with the score 3:4, still in favor of "6" but the default method should prevail
> in my sens.
>
> Serguei.
>
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel : +32 (0)9 264 61 79
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)
LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04
tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: sokol at insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr
More information about the R-devel
mailing list