[R] understanding patterns in categorical vs. continuous data

Fri Jan 27 04:07:42 CET 2006

From: Dave Roberts
> 
> You might prefer boxplot(insolation~veg_type) as a graphic.  
> That will 
> give you quantiles.  To get the actual numeric values you could
> 
> for (i in levels(veg_type)) {
>     print(i)
>     quantile(insolation[veg_type==i])
> }
> 
> see ?quantile for more help.

If you want the five-number summaries plotted in the boxplots, just look at
the returned object of boxplot():

> g <- factor(rep(1:3, 10))
> y <- rnorm(30)
> res <- boxplot(y ~ g)
> str(res)
List of 6
 $ stats: num [1:5, 1:3] -1.135 -0.757 -0.536  0.499  0.996 ...
 $ n    : num [1:3] 10 10 10
 $ conf : num [1:2, 1:3] -1.1639  0.0918 -0.5208  1.6546 -1.2487 ...
 $ out  : num(0) 
 $ group: num(0) 
 $ names: chr [1:3] "1" "2" "3"

If you just want to compute the summaries without the boxplots, use
fivenum():

> tapply(y, g, fivenum)
$"1"
[1] -1.1352456 -0.7571895 -0.5360496  0.4994445  0.9956749

$"2"
[1] -1.1408493 -0.3751730  0.5668747  1.8018146  2.0019303

$"3"
[1] -2.2309983 -0.9333305 -0.3402786  0.8849042  0.9833057

... and if you really want the quantiles, you can do that, too:

> tapply(y, g, quantile)
$"1"
        0%        25%        50%        75%       100% 
-1.1352456 -0.7391977 -0.5360496  0.3378861  0.9956749 

$"2"
        0%        25%        50%        75%       100% 
-1.1408493 -0.3039648  0.5668747  1.6669879  2.0019303 

$"3"
        0%        25%        50%        75%       100% 
-2.2309983 -0.8389260 -0.3402786  0.6746950  0.9833057 

... but note how the quartiles and hinges are not necessarily the same.

Andy

> Dylan Beaudette wrote:
> > Greetings,
> > 
> > I have a set of bivariate data: one variable (vegetation 
> type) which is 
> > categorical, and one (computed annual insolation) which is 
> continuous. 
> > Plotting veg_type ~ insolation produces a nice overview of 
> the patterns that 
> > I can see in the source data. However, due to the large 
> number of samples 
> > (1,000), and the apparent "spread" in the distribution of a 
> single vegetation 
> > type over a range of insolation values- I having a hard 
> time quantitatively 
> > describing the relationship between the two variables. 
> > 
> > Here is a link to a sample graph:
> > http://casoilresource.lawr.ucdavis.edu/drupal/node/162
> > 
> > Since the data along each vegetation type "line" is not a 
> distribution in the 
> > traditional sense, I am having problems applying 
> descriptive statistical 
> > methods. Conceptually, I would like to some how describe 
> the variation with 
> > insolation, along each vegetation type "line".
> > 
> > Any guidance, or suggested reading material would be 
> greatly appreciated.
> > 
> > 
> 
> 
> -- 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~~~~~~~~~~
> David W. Roberts                                     office 
> 406-994-4548
> Professor and Head                                      FAX 
> 406-994-3190
> Department of Ecology                         email 
> droberts at montana.edu
> Montana State University
> Bozeman, MT 59717-3460
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>