[R] Boxplot philosophy {was "Boxplot in R"}
Martin Maechler
maechler at stat.math.ethz.ch
Mon Jul 11 14:36:35 CEST 2005
>>>>> "AdaiR" == Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
>>>>> on Mon, 11 Jul 2005 03:04:44 +0100 writes:
AdaiR> Just an addendum on the philosophical aspect of doing
AdaiR> this. By selecting the 5% and 95% quantiles, you are
AdaiR> always going to get 10% of the data as "extreme" and
AdaiR> these points may not necessarily outliers. So when
AdaiR> you are comparing information from multiple columns
AdaiR> (i.e. boxplots), it is harder to say which column
AdaiR> contains more extreme value compared to others etc.
Yes, indeed!
People {and software implementations} have several times provided
differing definitions of how the boxplot whiskers should be defined.
I strongly believe that this is very often a very bad idea!!
A boxplot should be a universal mean communication and so one
should be *VERY* reluctant redefining the outliers.
I just find that Matlab (in their statistics toolbox)
does *NOT* use such a silly 5% / 95% definition of the whiskers,
at least not according to their documentation.
That's very good (and I wonder where you, Larry, got the idea of
the 5 / 95 %).
Using such a fixed percentage is really a very inferior idea to
John Tukey's definition {the one in use in all implementations
of S (including R) probably for close to 20 years now}.
I see one flaw in Tukey's definition {which is shared of course
by any silly "percentage" based ``outlier'' definition}:
The non-dependency on the sample size.
If you have a 1000 (or even many more) points,
you'll get more and more `outliers' even for perfectly normal data.
But then, I assume John Tukey would have told us to do more
sophisticated things {maybe things like the "violin plots"} than
boxplot if you have really very many data points, you may want
to see more features -- or he would have agreed to use
boxplot(*, range = monotone_slowly_growing(n) )
for largish sample sizes n.
Martin Maechler, ETH Zurich
AdaiR> Regards, Adai
AdaiR> On Sun, 2005-07-10 at 18:10 -0500, Larry Xie wrote:
>> I am trying to draw a plot like Matlab does:
>>
>> The upper extreme whisker represents 95% of the data;
>> The upper hinge represents 75% of the data;
>> The median represents 50% of the data;
>> The lower hinge represents 25% of the data;
>> The lower extreme whisker represents 5% of the data.
>>
>> It looks like:
>>
>> --- 95%
>> |
>> |
>> ------- 75%
>> | |
>> |-----| 50%
>> | |
>> | |
>> ------- 25%
>> |
>> --- 5%
>>
>> Anyone can give me some hints as to how to draw a boxplot like that?
>> What function does it? I tried boxplot() but couldn't figure it out.
>> If it's boxplot(), what arguments should I pass to the function? Thank
>> you for your help. I'd appreciate it.
More information about the R-help
mailing list