[R] Boxplot philosophy {was "Boxplot in R"}

Martin Maechler maechler at stat.math.ethz.ch
Mon Jul 11 14:36:35 CEST 2005


>>>>> "AdaiR" == Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
>>>>>     on Mon, 11 Jul 2005 03:04:44 +0100 writes:

    AdaiR> Just an addendum on the philosophical aspect of doing
    AdaiR> this.  By selecting the 5% and 95% quantiles, you are
    AdaiR> always going to get 10% of the data as "extreme" and
    AdaiR> these points may not necessarily outliers.  So when
    AdaiR> you are comparing information from multiple columns
    AdaiR> (i.e.  boxplots), it is harder to say which column
    AdaiR> contains more extreme value compared to others etc.

Yes, indeed!

People {and software implementations} have several times provided
differing definitions of how the boxplot whiskers should be defined.

I strongly believe that this is very often a very bad idea!!

A boxplot should be a universal mean communication and so one
should be *VERY* reluctant redefining the outliers.

I just find that Matlab (in their statistics toolbox)
does *NOT* use such a silly 5% / 95% definition of the whiskers,
at least not according to their documentation.
That's very good (and I wonder where you, Larry, got the idea of
the 5 / 95 %).
Using such a fixed percentage is really a very inferior idea to
John Tukey's definition {the one in use in all implementations
of S (including R) probably for close to 20 years now}.

I see one flaw in Tukey's definition {which is shared of course
by any silly "percentage" based ``outlier'' definition}:

   The non-dependency on the sample size.

If you have a 1000 (or even many more) points,
you'll get more and more `outliers' even for perfectly normal data.

But then, I assume John Tukey would have told us to do more
sophisticated things {maybe things like the "violin plots"} than
boxplot  if you have really very many data points, you may want
to see more features -- or he would have agreed to use 
   boxplot(*,  range = monotone_slowly_growing(n) )
for largish sample sizes n.

Martin Maechler, ETH Zurich




    AdaiR> Regards, Adai

    AdaiR> On Sun, 2005-07-10 at 18:10 -0500, Larry Xie wrote:
    >> I am trying to draw a plot like Matlab does: 
    >> 
    >> The upper extreme whisker represents 95% of the data;
    >> The upper hinge represents 75% of the data;
    >> The median represents 50% of the data;
    >> The lower hinge represents 25% of the data;
    >> The lower extreme whisker represents 5% of the data.
    >> 
    >> It looks like:
    >> 
    >> ---         95%
    >> |
    >> |
    >> -------       75%
    >> |     |
    >> |-----|       50%
    >> |     |
    >> |     |
    >> -------       25%
    >> |
    >> ---         5%
    >> 
    >> Anyone can give me some hints as to how to draw a boxplot like that?
    >> What function does it? I tried boxplot() but couldn't figure it out.
    >> If it's boxplot(), what arguments should I pass to the function? Thank
    >> you for your help. I'd appreciate it.




More information about the R-help mailing list