[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Serguei Sokol sokol at insa-toulouse.fr
Mon May 29 15:13:46 CEST 2017


Here is an attached patch.

Best,
Serguei.

Le 29/05/2017 à 12:21, Serguei Sokol a écrit :
> The problem or actual R implementation relies on an assumption
> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
> which reveals not to be true despite very trustful appearance.
>
> If we continue with the example of x=y=1:9
> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
> while median(y[i] | x[i] <= quantile(x, 1/3))=2
> On the other sample's side we've got 7.5 and 8 for x and y respectively.
> Hence the slope (8-2)/(7.5-2.5)=1.2
>
> To get a correct version of this, one should calculate x robust points in the same way as the y's,
> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3))
>
> Best,
> Serguei.
>
> Le 29/05/2017 à 10:02, peter dalgaard a écrit :
>> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth:
>>
>> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line
>>
>> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line().
>>
>> Notice, incidentally, that
>>
>>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))
>> Call:
>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))
>>
>> Coefficients:
>> [1]  -0.9407   1.1948
>>
>> I.e., it is not likely an issue with exact integers or perfect fit.
>>
>> -pd
>>
>>
>>
>>> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote:
>>>
>>>> Tukey divides the points into three groups, not the x and y values
>>> separately.
>>>
>>>> I'll try to get hold of the book for a direct quote, might take a couple
>>> of days.
>>>
>>> Ah well, I can't get it for a week. But the fact that it's often called
>>> Tukey's three group line (try a search on *tukey three group line* and
>>> you'll get plenty of hits) is pretty much a giveaway.
>>>
>>>
>>> On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote:
>>>
>>>> Tukey divides the points into three groups, not the x and y values
>>>> separately.
>>>>
>>>> I'll try to get hold of the book for a direct quote, might take a couple
>>>> of days.
>>>>
>>>>
>>>>
>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
>>>> wrote:
>>>>
>>>>> On 27/05/2017 9:28 PM, GlenB wrote:
>>>>>
>>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
>>>>>> or
>>>>>> 3
>>>>>>
>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
>>>>>> intercept -1 and slope 1.2
>>>>>>
>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
>>>>>> of
>>>>>> length 6, with four of every six correct.
>>>>>>
>>>>>> Bug has been present across many versions.
>>>>>>
>>>>>> The machine I just tried it on just now has R3.2.3:
>>>>>>
>>>>> If you look at the source (in src/library/stats/src/line.c), the
>>>>> explanation is clear:  the x value is chosen as the 1/6 quantile (according
>>>>> to a particular definition of quantile), and the y value is chosen as the
>>>>> median of the y values where x is less than or equal to the 1/3 quantile.
>>>>> Those are different definitions (though I think they would be
>>>>> asymptotically equivalent under pretty weak assumptions), so it's not
>>>>> surprising the x value doesn't correspond perfectly to the y value, and the
>>>>> line ends up "wrong".
>>>>>
>>>>> So is it a bug?  Well, that depends on Tukey's definition.  I don't have
>>>>> a copy of his book handy so I can't really say.  Maybe the R function is
>>>>> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
>>>>> is wrong.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: sokol at insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr

-------------- next part --------------
A non-text attachment was scrubbed...
Name: line.c.patch
Type: text/x-patch
Size: 2651 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170529/1f1f2e6c/attachment.bin>


More information about the R-devel mailing list