[R] boxplot notches
Christoph Scherber
Christoph.Scherber at uni-jena.de
Tue Mar 2 15:51:06 CET 2004
In McGill et al. (1978) there´s a description of the calculation as
follows (p. 16):
"The widths [are] computed from the midspread or interquartile range (R)
of the data (...), and the number of observations (N) for each group.
The Gaussian-based asymptotic approximation (Kendall and Stuart 1967) of
the standard deviation s of the median (M) is given by
s=1.25 R/1.35 sqrt(N)
and can be shown to be reasonably broadly applicable to other
distributions (...)
The notch around each median can then be calculated as
M +- Cs,
where C is a constant. Should one desire a notch indicating 95 percent
confidence interval about each median, C = 1.96 would be used (...)
It can be shown that C=1.96 would only be appropriate if the standard
deviations of the two groups were vastly different (...) Thus, the
notches were computed as
M+-1.7(1.25R/1.35 sqrt(N))
Hope this helps. Best regards
Chris.
REF:
McGill, R; Tukey, JW & Larsen, WA (1978) Variations of Box Plots. The
American Statistician, Vol.32 No. 1, pp.12-16.
Kendall, MG & Stuart, A (1967): The Advanced Theory of Statistics,
Vol.1, 2nd ed., Ch14., New York, Hafner Publishing Co.
*****************************************
Michael Friendly wrote:
>>
>>
>>> I think John Tukey's idea was that this formula (or just the fact of
>>>
>>>> using median and quartiles) is still often approximately correct
>>>> for quite a few kinds of moderate contaminations...
>>>
>>>
>>
>>
>> It may be approximately correct for the width of a CI (and when I
>> checked it was only appproximately correct for a normal), but I would
>> seriously doubt if it were approximately correct for a significance
>> level of 5%.
>> Remember how fast the tails of the asymptotic normal distribution
>> decay: a 20% error turns 5% into 2%.
>>
>> BTW, if there is a precise reference for this it would be good to add it
>> to boxplot.stats.Rd, as the confidence limits are unexplained there.
>>
>>
>>
>
> The factor 1.58 for H-spr/\sqrt{n} comes from the product of three
> approximations going from a 95%
> confidence interval for a difference in means, to one for a difference
> in medians, using the H-spr=IQR
> instead of the standard deviation:
>
> H-spr/1.349 \approx \sigma in a N(0,1) dist/n
> \sqrt{ \pi / 2} \approx std error of a median
> 1.7 / sqrt{n} is the average of 1.96 and 1.39=1.96/\sqrt{2},
> factors for the standard error of the difference
> between two means, in the cases where one variance is tiny,
> and where both are equal.
>
> I believe this is explained in
>
> @Article{McGill-etal:78,
> author = "R. McGill and J. W. Tukey and W. Larsen",
> year = "1978",
> title = "Variations of Box Plots",
> journal = TAS,
> volume = "32",
> pages = "12--16",
> }
>
More information about the R-help
mailing list