[R] descriptive stats by cells in factorial design

Wed Aug 7 01:02:41 CEST 2013

I received two additional suggestions, one off-list, both appended below. 
Both helped me to learn a bit more about how to get what I want.

First, the aggregate() function is in package:stats, it provides the 
numbers I needed, but I don't like the output format as much as I liked 
the format from doBy:summaryBy().  Here it is:

> aggregate(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, function(x) c(mean=mean(x), sd=sd(x), quantile(x), N=length(x)))
    Generation Zygosity    Sex Cohort ESstatus    Age.mean      Age.sd      Age.0%     Age.25%     Age.50%     Age.75%    Age.100%       Age.N
1   Offspring       DZ Female     11       ES  17.7852830   0.3535863  16.9300000  17.6000000  17.7750000  17.9650000  18.9200000 106.0000000
2      Parent       DZ Female     11       ES  44.6151240   5.1246314  32.1700000  41.3400000  44.6800000  48.2800000  57.9500000 121.0000000
3   Offspring       MZ Female     11       ES  17.8762755   0.4506530  16.8600000  17.6775000  17.8050000  18.1000000  19.1200000 196.0000000
4      Parent       MZ Female     11       ES  44.2347573   5.0214627  29.5500000  40.6925000  44.1250000  47.7300000  56.7300000 206.0000000
5   Offspring       DZ   Male     11       ES  17.7614925   0.3467540  17.1800000  17.5150000  17.7150000  18.0000000  18.7100000 134.0000000
6      Parent       DZ   Male     11       ES  44.6020635   4.5605484  34.3100000  41.4475000  44.8900000  47.4975000  58.7500000 126.0000000
7   Offspring       MZ   Male     11       ES  17.7717436   0.3236917  16.8400000  17.5800000  17.7900000  17.9700000  19.0200000 195.0000000
8      Parent       MZ   Male     11       ES  43.4078680   5.3507439  31.2800000  39.9700000  43.4400000  46.4800000  64.6500000 197.0000000
9   Offspring       DZ Female     11    notES  18.1367901   0.5555968  16.7600000  17.8525000  18.1900000  18.4575000  19.5000000 162.0000000
10     Parent       DZ Female     11    notES  42.5434579   4.3670998  34.0300000  39.3450000  42.1100000  45.5500000  57.0600000 107.0000000
11  Offspring       MZ Female     11    notES  18.0573883   0.6103713  16.7600000  17.6300000  18.0500000  18.4200000  19.7000000 291.0000000
12     Parent       MZ Female     11    notES  42.3198837   5.3622671  30.3100000  38.6050000  41.8350000  46.0175000  56.5800000 172.0000000
13  Offspring       DZ   Male     11    notES  17.8766667   0.5187333  16.8300000  17.4600000  17.8600000  18.2400000  19.0200000 153.0000000
14     Parent       DZ   Male     11    notES  42.7112102   4.9600561  32.0500000  39.2400000  42.7600000  45.2700000  58.2000000 157.0000000
15  Offspring       MZ   Male     11    notES  17.8771831   0.6472397  16.5600000  17.3300000  17.8550000  18.2100000  20.0100000 284.0000000
16     Parent       MZ   Male     11    notES  41.5636254   4.6564818  32.1000000  38.0250000  41.3900000  44.6450000  65.2900000 331.0000000
17  Offspring       DZ Female     17    notES  17.4752880   0.4569588  16.5600000  17.0700000  17.5900000  17.8700000  18.2900000 191.0000000
18     Parent       DZ Female     17    notES  46.3055882   4.9177705  36.1000000  42.7275000  45.7650000  48.3350000  62.6900000  68.0000000
19  Offspring       MZ Female     17    notES  17.4106076   0.4956190  16.5500000  16.9700000  17.3400000  17.8200000  18.4500000 395.0000000
20     Parent       MZ Female     17    notES  46.3649032   5.1770435  34.8800000  42.4200000  45.9500000  49.4950000  63.1800000 155.0000000
21  Offspring       DZ   Male     17    notES  17.5041818   0.3915823  16.7300000  17.1900000  17.5300000  17.8300000  18.5200000 165.0000000
22     Parent       DZ   Male     17    notES  46.7745763   4.0226198  40.1800000  44.1250000  46.0000000  48.8200000  61.1200000  59.0000000
23  Offspring       MZ   Male     17    notES  17.4911446   0.3961757  16.6500000  17.1775000  17.5000000  17.8100000  18.3500000 332.0000000
24     Parent       MZ   Male     17    notES  46.6929771   5.2421896  34.4500000  43.1500000  45.8900000  49.0050000  63.8000000 131.0000000

That's great but there are two things I didn't like:  (1) There too many 
digits, especially on the integers in the last column.  I thought five 
digits to the right of the decimal was more than enough but here we have 
seven, even for integers.  (2) The ordering of levels within factors 
implied by the right side of the formula is not honored -- it looks like 
it used the order Cohort, ESstatus, Sex, Zygosity, Generation.  Unlike 
doBy::summaryBy(), it does not accept an order=T argument (that is the 
default in doBy::summaryBy()).

One thing both suggestions taught me was to use names in function 
definitions so that I always get correct column headings on output.  This 
was in the documentation for doBy::summaryBy(), but I didn't understand it 
when I first read it.  Using that naming concept, I created this function:

descriptivefun <- function(x, ...){c(mean=mean(x, ...), sd=sd(x, ...), quantile(x, ...), N=sum(!is.na(x)), NAs=sum(is.na(x)))}

That will allow me to feed the na.rm=T argument to the mean, sd and 
quantile functions.  By not naming the quantile function (e.g., not using 
q=quantile(x, ...)) I allow the builtin column names to be used unaltered 
(i.e., 0%, 25%, 50%, 75%, 100%).  I also did not use length() because it 
will count NA values and I want to see the sample sizes used for mean, sd 
and quantile.  To deal with that problem I created a function with output 
named "N" to count those sample sizes and one with output named "NAs" to 
count the number of NAs.  Then I do this:

> summaryBy(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, FUN=descriptivefun, na.rm=T)
    Generation Zygosity    Sex Cohort ESstatus Age.mean    Age.sd Age.0% Age.25% Age.50% Age.75% Age.100% Age.N Age.NAs
1   Offspring       DZ Female     11       ES 17.78528 0.3535863  16.93 17.6000  17.775 17.9650    18.92   106       0
2   Offspring       DZ Female     11    notES 18.13679 0.5555968  16.76 17.8525  18.190 18.4575    19.50   162       0
3   Offspring       DZ Female     17    notES 17.47529 0.4569588  16.56 17.0700  17.590 17.8700    18.29   191       0
4   Offspring       DZ   Male     11       ES 17.76149 0.3467540  17.18 17.5150  17.715 18.0000    18.71   134       0
5   Offspring       DZ   Male     11    notES 17.87667 0.5187333  16.83 17.4600  17.860 18.2400    19.02   153       0
6   Offspring       DZ   Male     17    notES 17.50418 0.3915823  16.73 17.1900  17.530 17.8300    18.52   165       0
7   Offspring       MZ Female     11       ES 17.87628 0.4506530  16.86 17.6775  17.805 18.1000    19.12   196       0
8   Offspring       MZ Female     11    notES 18.05739 0.6103713  16.76 17.6300  18.050 18.4200    19.70   291       0
9   Offspring       MZ Female     17    notES 17.41061 0.4956190  16.55 16.9700  17.340 17.8200    18.45   395       0
10  Offspring       MZ   Male     11       ES 17.77174 0.3236917  16.84 17.5800  17.790 17.9700    19.02   195       0
11  Offspring       MZ   Male     11    notES 17.87718 0.6472397  16.56 17.3300  17.855 18.2100    20.01   284       0
12  Offspring       MZ   Male     17    notES 17.49114 0.3961757  16.65 17.1775  17.500 17.8100    18.35   332       0
13     Parent       DZ Female     11       ES 44.61512 5.1246314  32.17 41.3400  44.680 48.2800    57.95   121       0
14     Parent       DZ Female     11    notES 42.54346 4.3670998  34.03 39.3450  42.110 45.5500    57.06   107       0
15     Parent       DZ Female     17    notES 46.30559 4.9177705  36.10 42.7275  45.765 48.3350    62.69    68       0
16     Parent       DZ   Male     11       ES 44.60206 4.5605484  34.31 41.4475  44.890 47.4975    58.75   126       0
17     Parent       DZ   Male     11    notES 42.71121 4.9600561  32.05 39.2400  42.760 45.2700    58.20   157       0
18     Parent       DZ   Male     17    notES 46.77458 4.0226198  40.18 44.1250  46.000 48.8200    61.12    59       0
19     Parent       MZ Female     11       ES 44.23476 5.0214627  29.55 40.6925  44.125 47.7300    56.73   206       0
20     Parent       MZ Female     11    notES 42.31988 5.3622671  30.31 38.6050  41.835 46.0175    56.58   172       0
21     Parent       MZ Female     17    notES 46.36490 5.1770435  34.88 42.4200  45.950 49.4950    63.18   155       0
22     Parent       MZ   Male     11       ES 43.40787 5.3507439  31.28 39.9700  43.440 46.4800    64.65   197       0
23     Parent       MZ   Male     11    notES 41.56363 4.6564818  32.10 38.0250  41.390 44.6450    65.29   331       0
24     Parent       MZ   Male     17    notES 46.69298 5.2421896  34.45 43.1500  45.890 49.0050    63.80   131       0

I think that output looks very nice.  One thing that I don't understand is 
why my function produces %.5f output for every value but the 
doBy::summaryBy() function uses different formats in different columns. 
Compare the above output with this output:

> descriptivefun(x$Age)
       mean         sd         0%        25%        50%        75%       100%          N        NAs
   28.49302   13.29077   16.55000   17.65000   18.23000   42.25500   65.29000 4434.00000    0.00000

It's not a big deal, but it would be cool if I could tell 
doBy::summaryBy() how to format the numbers using something like 
format=c(rep("%.2f",7), rep("%d",2)).

Mike

--
Michael B. Miller, Ph.D.
Minnesota Center for Twin and Family Research
Department of Psychology
University of Minnesota

On Mon, 5 Aug 2013, David Carlson wrote:

> This is a bit simpler. The function quantile() labels the output whereas 
> fivenum() does not:
>
> aggregate(Age ~ Generation + Zygosity + Sex + Cohort +
> ESstatus, data=x,
>    function(x) c(mean=mean(x), sd=sd(x), quantile(x)))

On Mon, 5 Aug 2013, Dr. Thomas W. MacFarland wrote:

> Dear Dr. Miller:
>
> Pasted below is syntax that should mostly answer your recent question to 
> the R mailing list:
>
> descriptivefun <- function(x, ...){
>  c(m=mean(x, ...), sd=sd(x, ...), l=length(x))
>  }
>
> doBy::summaryBy(Final ~ Method.recode +
>  ComCol.recode,
>  data=Final.table,
>  FUN=descriptivefun,
>  na.rm=TRUE,
>  keep.names=TRUE,
>  order=TRUE)
>
> I go into far more detail on this package::function and similar 
> functions in my recent text on Twoway ANOVA,
> http://www.springer.com/statistics/social+sciences+%26+law/book/978-1-4614-2133-7.
>
> Best wishes.
>
> Tom