[R] Boxplot philosophy {was "Boxplot in R"}

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Mon Jul 11 23:51:36 CEST 2005


On 11-Jul-05 Martin Maechler wrote:
>>>>>> "AdaiR" == Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
>>>>>>     on Mon, 11 Jul 2005 03:04:44 +0100 writes:
> 
>     AdaiR> Just an addendum on the philosophical aspect of doing
>     AdaiR> this.  By selecting the 5% and 95% quantiles, you are
>     AdaiR> always going to get 10% of the data as "extreme" and
>     AdaiR> these points may not necessarily outliers.  So when
>     AdaiR> you are comparing information from multiple columns
>     AdaiR> (i.e.  boxplots), it is harder to say which column
>     AdaiR> contains more extreme value compared to others etc.
> 
> Yes, indeed!
> 
> People {and software implementations} have several times provided
> differing definitions of how the boxplot whiskers should be defined.
> 
> I strongly believe that this is very often a very bad idea!!
> 
> A boxplot should be a universal mean communication and so one
> should be *VERY* reluctant redefining the outliers.
> 
> I just find that Matlab (in their statistics toolbox)
> does *NOT* use such a silly 5% / 95% definition of the whiskers,
> at least not according to their documentation.
> That's very good (and I wonder where you, Larry, got the idea of
> the 5 / 95 %).
> Using such a fixed percentage is really a very inferior idea to
> John Tukey's definition {the one in use in all implementations
> of S (including R) probably for close to 20 years now}.
> 
> I see one flaw in Tukey's definition {which is shared of course
> by any silly "percentage" based ``outlier'' definition}:
> 
>    The non-dependency on the sample size.
> 
> If you have a 1000 (or even many more) points,
> you'll get more and more `outliers' even for perfectly normal data.
> 
> But then, I assume John Tukey would have told us to do more
> sophisticated things {maybe things like the "violin plots"} than
> boxplot  if you have really very many data points, you may want
> to see more features -- or he would have agreed to use 
>    boxplot(*,  range = monotone_slowly_growing(n) )
> for largish sample sizes n.
> 
> Martin Maechler, ETH Zurich

I happily agree with Martin's essay on Boxplot philiosophy.

It would cerainly confuse boxplot watchers if the interpretation
of what they saw had to vary from case to case. The fact that
careful (and necessarily detailed) explanations of what was
different this time would be necessary in the text would not
help much, and would defeat the primary objective of the boxplot
which is to present a summary of features of the data in a form
which can be grasped visually very quickly indeed.

I'm sure many of us have at times felt some frustration at the
rigidly precise numerical interpretations which Tukey imposed
on the elements of his many EDA techniques; but this did ensure
that the viewer really knew, at a glance, what he was looking at.

EDA brilliantly combined several aspects of "looking at data":
selection of features of the data; highly efficient encoding of
these, and of their inter-relationships, into a medium directly
adapted to visual perception; robustness (so that the perceptions
were not unstable with respect to wondering just what the underlying
distribution might be); accessibility (in the sense of being truly
understood) to non-theoreticians; and capacity to be implemented on
primitive information technology.

Indeed, one might say that the "core team" of EDA consists of the
techniques for which you need only pencil and paper.

Nevertheless, Tukey was no rigid dogmatist. His objective was
always to give a good representation of the data, and he would
happily shift his ground, or adapt a technique (albeit probably
giving it a different name), or devise a new one, if that would
be useful for the case in hand.

Best wishes to all,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 11-Jul-05                                       Time: 22:19:47
------------------------------ XFMail ------------------------------




More information about the R-help mailing list