[R] Whiskers on the default boxplot {graphics}
Peter Ehlers
ehlers at ucalgary.ca
Wed May 12 21:06:59 CEST 2010
On 2010-05-12 10:51, Robert Baer wrote:
>
> ----- Original Message ----- Fantastic!
>
> It would be great if the description could be modified to include the
> mysterious bit about the upper and lower bound whisker positions:
>
> upper whisker = min(max(x), Q_3 + 1.5 * IQR)
> lower whisker = max(min(x), Q_1 - 1.5 * IQR)
>
> -- snip --
> ----------------------
> NOT quite!
>
> The boxplot.stats help reads under the coef argument:
> "... the whiskers extend to the most extreme data point which is no more
> than coef times the length of the box away from the box."
>
>
> If there are outliers, and the most extreme data point within 1.5 *IQR
> of Q1 or Q3 is less than 1.5 IQRs, and the wisker may "end earlier" than
> 1.5*IQR, but the data point at which it ends may NOT be max(x) or min(x).
>
But even this is not quite correct.
The help page (quoted above) is, as is so often the case,
quite precise: the *length of the box* is multiplied by 1.5,
not the *IQR*. The difference is probably insignificant in most
applications, but then this question was about the precise
definition of the whiskers.
The box length is defined by the hinges, for whose definition
it's probably easiest to look at the code in fivenum() which
is used by boxplot.stats(). (The relevant code consists of three
short lines.) For the calculation of the whisker extremes, one
can peruse the boxplot.stats() code, which also is quite brief.
Essentially, it determines which observations lie outside the
boundaries established by (lower hinge - 1.5 * boxlength) and
(upper hinge + 1.5 * boxlength) and then uses the range of
the remaining data values to determine the whisker extremes.
(I've assumed the default value of coef=1.5).
Here's an example:
set.seed(1)
y <- rexp(30, .02)
y <- sort(round(y))
fivenum(y)
#[1] 3 22 38 61 221
boxplot.stats(y)$stats
#[1] 3 22 38 61 118
# The hinges are 22, 61;
# The whisker extremes are 3, 118;
quantile(y, c(1,3)/4)
# 25% 75%
#23.25 60.50
# The hinges do not equal the quartiles.
# Upper cut-off ('fence'):
61 + 1.5 * (61 - 22)
#[1] 119.5
tail(y)
#[1] 70 94 118 145 198 221
# So 118 is the largest data value less than or equal to 119.5.
60.5 + 1.5 * IQR(y)
#[1] 116.375
# Using quartiles and the IQR would take the upper whisker to 94.
--
Peter Ehlers
More information about the R-help
mailing list