[R] Boxplot philosophy {was "Boxplot in R"}
Spencer Graves
spencer.graves at pdf.com
Tue Jul 12 04:08:39 CEST 2005
I'll bite: How does one detect bimodalidty from a boxplot?
spencer graves
Berton Gunter wrote:
> FWIW:
>
> I have been an enthusiastic user of boxplots for decades. Of course, the
> issue of how to handle the whiskers ("outliers"] is a valid one, and indeed
> sample size related. Dogma is always dangerous. I got to know John Tukey
> somewhat (I used to chauffer him to and from meetings with a group of Merck
> statisticians), and I,too,think he would have been the first to agree that
> some flexibility here is wise.
>
> HOWEVER, the chief advantage of boxplots is their simplicity at displaying
> simultaneously and easily **several** important aspects of the data, of
> which outliers are probably the most problematic (as they often result in
> severe distortion of the plots without careful scaling). Even with dozens of
> boxplots, center, scale, and skewness are easy to discern and compare. I
> think this would NOT be true of "violin" plots and other more complex
> versions -- simplicity can be a virtue.
>
> Finally, a tidbit for boxplot afficianados: how does one detect bimodality
> from a boxplot?
>
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>
> "The business of the statistician is to catalyze the scientific learning
> process." - George E. P. Box
>
>
>
>
>>-----Original Message-----
>>From: r-help-bounces at stat.math.ethz.ch
>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Ted Harding
>>Sent: Monday, July 11, 2005 2:52 PM
>>To: r-help at stat.math.ethz.ch
>>Subject: Re: [R] Boxplot philosophy {was "Boxplot in R"}
>>
>>On 11-Jul-05 Martin Maechler wrote:
>>
>>>>>>>>"AdaiR" == Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
>>>>>>>> on Mon, 11 Jul 2005 03:04:44 +0100 writes:
>>>
>>> AdaiR> Just an addendum on the philosophical aspect of doing
>>> AdaiR> this. By selecting the 5% and 95% quantiles, you are
>>> AdaiR> always going to get 10% of the data as "extreme" and
>>> AdaiR> these points may not necessarily outliers. So when
>>> AdaiR> you are comparing information from multiple columns
>>> AdaiR> (i.e. boxplots), it is harder to say which column
>>> AdaiR> contains more extreme value compared to others etc.
>>>
>>>Yes, indeed!
>>>
>>>People {and software implementations} have several times provided
>>>differing definitions of how the boxplot whiskers should be defined.
>>>
>>>I strongly believe that this is very often a very bad idea!!
>>>
>>>A boxplot should be a universal mean communication and so one
>>>should be *VERY* reluctant redefining the outliers.
>>>
>>>I just find that Matlab (in their statistics toolbox)
>>>does *NOT* use such a silly 5% / 95% definition of the whiskers,
>>>at least not according to their documentation.
>>>That's very good (and I wonder where you, Larry, got the idea of
>>>the 5 / 95 %).
>>>Using such a fixed percentage is really a very inferior idea to
>>>John Tukey's definition {the one in use in all implementations
>>>of S (including R) probably for close to 20 years now}.
>>>
>>>I see one flaw in Tukey's definition {which is shared of course
>>>by any silly "percentage" based ``outlier'' definition}:
>>>
>>> The non-dependency on the sample size.
>>>
>>>If you have a 1000 (or even many more) points,
>>>you'll get more and more `outliers' even for perfectly normal data.
>>>
>>>But then, I assume John Tukey would have told us to do more
>>>sophisticated things {maybe things like the "violin plots"} than
>>>boxplot if you have really very many data points, you may want
>>>to see more features -- or he would have agreed to use
>>> boxplot(*, range = monotone_slowly_growing(n) )
>>>for largish sample sizes n.
>>>
>>>Martin Maechler, ETH Zurich
>>
>>I happily agree with Martin's essay on Boxplot philiosophy.
>>
>>It would cerainly confuse boxplot watchers if the interpretation
>>of what they saw had to vary from case to case. The fact that
>>careful (and necessarily detailed) explanations of what was
>>different this time would be necessary in the text would not
>>help much, and would defeat the primary objective of the boxplot
>>which is to present a summary of features of the data in a form
>>which can be grasped visually very quickly indeed.
>>
>>I'm sure many of us have at times felt some frustration at the
>>rigidly precise numerical interpretations which Tukey imposed
>>on the elements of his many EDA techniques; but this did ensure
>>that the viewer really knew, at a glance, what he was looking at.
>>
>>EDA brilliantly combined several aspects of "looking at data":
>>selection of features of the data; highly efficient encoding of
>>these, and of their inter-relationships, into a medium directly
>>adapted to visual perception; robustness (so that the perceptions
>>were not unstable with respect to wondering just what the underlying
>>distribution might be); accessibility (in the sense of being truly
>>understood) to non-theoreticians; and capacity to be implemented on
>>primitive information technology.
>>
>>Indeed, one might say that the "core team" of EDA consists of the
>>techniques for which you need only pencil and paper.
>>
>>Nevertheless, Tukey was no rigid dogmatist. His objective was
>>always to give a good representation of the data, and he would
>>happily shift his ground, or adapt a technique (albeit probably
>>giving it a different name), or devise a new one, if that would
>>be useful for the case in hand.
>>
>>Best wishes to all,
>>Ted.
>>
>>
>>--------------------------------------------------------------------
>>E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
>>Fax-to-email: +44 (0)870 094 0861
>>Date: 11-Jul-05 Time: 22:19:47
>>------------------------------ XFMail ------------------------------
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide!
>>http://www.R-project.org/posting-guide.html
>>
>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
--
Spencer Graves, PhD
Senior Development Engineer
PDF Solutions, Inc.
333 West San Carlos Street Suite 700
San Jose, CA 95110, USA
spencer.graves at pdf.com
www.pdf.com <http://www.pdf.com>
Tel: 408-938-4420
Fax: 408-280-7915
More information about the R-help
mailing list