# [R] understanding patterns in categorical vs. continuous data

Liaw, Andy andy_liaw at merck.com
Fri Jan 27 04:07:42 CET 2006

```From: Dave Roberts
>
> You might prefer boxplot(insolation~veg_type) as a graphic.
> That will
> give you quantiles.  To get the actual numeric values you could
>
> for (i in levels(veg_type)) {
>     print(i)
>     quantile(insolation[veg_type==i])
> }
>
> see ?quantile for more help.

If you want the five-number summaries plotted in the boxplots, just look at
the returned object of boxplot():

> g <- factor(rep(1:3, 10))
> y <- rnorm(30)
> res <- boxplot(y ~ g)
> str(res)
List of 6
\$ stats: num [1:5, 1:3] -1.135 -0.757 -0.536  0.499  0.996 ...
\$ n    : num [1:3] 10 10 10
\$ conf : num [1:2, 1:3] -1.1639  0.0918 -0.5208  1.6546 -1.2487 ...
\$ out  : num(0)
\$ group: num(0)
\$ names: chr [1:3] "1" "2" "3"

If you just want to compute the summaries without the boxplots, use
fivenum():

> tapply(y, g, fivenum)
\$"1"
[1] -1.1352456 -0.7571895 -0.5360496  0.4994445  0.9956749

\$"2"
[1] -1.1408493 -0.3751730  0.5668747  1.8018146  2.0019303

\$"3"
[1] -2.2309983 -0.9333305 -0.3402786  0.8849042  0.9833057

... and if you really want the quantiles, you can do that, too:

> tapply(y, g, quantile)
\$"1"
0%        25%        50%        75%       100%
-1.1352456 -0.7391977 -0.5360496  0.3378861  0.9956749

\$"2"
0%        25%        50%        75%       100%
-1.1408493 -0.3039648  0.5668747  1.6669879  2.0019303

\$"3"
0%        25%        50%        75%       100%
-2.2309983 -0.8389260 -0.3402786  0.6746950  0.9833057

... but note how the quartiles and hinges are not necessarily the same.

Andy

> Dylan Beaudette wrote:
> > Greetings,
> >
> > I have a set of bivariate data: one variable (vegetation
> type) which is
> > categorical, and one (computed annual insolation) which is
> continuous.
> > Plotting veg_type ~ insolation produces a nice overview of
> the patterns that
> > I can see in the source data. However, due to the large
> number of samples
> > (1,000), and the apparent "spread" in the distribution of a
> single vegetation
> > type over a range of insolation values- I having a hard
> time quantitatively
> > describing the relationship between the two variables.
> >
> > Here is a link to a sample graph:
> > http://casoilresource.lawr.ucdavis.edu/drupal/node/162
> >
> > Since the data along each vegetation type "line" is not a
> distribution in the
> > traditional sense, I am having problems applying
> descriptive statistical
> > methods. Conceptually, I would like to some how describe
> the variation with
> > insolation, along each vegetation type "line".
> >
> > Any guidance, or suggested reading material would be
> greatly appreciated.
> >
> >
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~~~~~~~~~~
> David W. Roberts                                     office
> 406-994-4548
> 406-994-3190
> Department of Ecology                         email
> droberts at montana.edu
> Montana State University
> Bozeman, MT 59717-3460
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help