[R] descriptive stats by cells in factorial design
Mike Miller
mbmiller+l at gmail.com
Wed Aug 7 01:02:41 CEST 2013
I received two additional suggestions, one off-list, both appended below.
Both helped me to learn a bit more about how to get what I want.
First, the aggregate() function is in package:stats, it provides the
numbers I needed, but I don't like the output format as much as I liked
the format from doBy:summaryBy(). Here it is:
> aggregate(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, function(x) c(mean=mean(x), sd=sd(x), quantile(x), N=length(x)))
Generation Zygosity Sex Cohort ESstatus Age.mean Age.sd Age.0% Age.25% Age.50% Age.75% Age.100% Age.N
1 Offspring DZ Female 11 ES 17.7852830 0.3535863 16.9300000 17.6000000 17.7750000 17.9650000 18.9200000 106.0000000
2 Parent DZ Female 11 ES 44.6151240 5.1246314 32.1700000 41.3400000 44.6800000 48.2800000 57.9500000 121.0000000
3 Offspring MZ Female 11 ES 17.8762755 0.4506530 16.8600000 17.6775000 17.8050000 18.1000000 19.1200000 196.0000000
4 Parent MZ Female 11 ES 44.2347573 5.0214627 29.5500000 40.6925000 44.1250000 47.7300000 56.7300000 206.0000000
5 Offspring DZ Male 11 ES 17.7614925 0.3467540 17.1800000 17.5150000 17.7150000 18.0000000 18.7100000 134.0000000
6 Parent DZ Male 11 ES 44.6020635 4.5605484 34.3100000 41.4475000 44.8900000 47.4975000 58.7500000 126.0000000
7 Offspring MZ Male 11 ES 17.7717436 0.3236917 16.8400000 17.5800000 17.7900000 17.9700000 19.0200000 195.0000000
8 Parent MZ Male 11 ES 43.4078680 5.3507439 31.2800000 39.9700000 43.4400000 46.4800000 64.6500000 197.0000000
9 Offspring DZ Female 11 notES 18.1367901 0.5555968 16.7600000 17.8525000 18.1900000 18.4575000 19.5000000 162.0000000
10 Parent DZ Female 11 notES 42.5434579 4.3670998 34.0300000 39.3450000 42.1100000 45.5500000 57.0600000 107.0000000
11 Offspring MZ Female 11 notES 18.0573883 0.6103713 16.7600000 17.6300000 18.0500000 18.4200000 19.7000000 291.0000000
12 Parent MZ Female 11 notES 42.3198837 5.3622671 30.3100000 38.6050000 41.8350000 46.0175000 56.5800000 172.0000000
13 Offspring DZ Male 11 notES 17.8766667 0.5187333 16.8300000 17.4600000 17.8600000 18.2400000 19.0200000 153.0000000
14 Parent DZ Male 11 notES 42.7112102 4.9600561 32.0500000 39.2400000 42.7600000 45.2700000 58.2000000 157.0000000
15 Offspring MZ Male 11 notES 17.8771831 0.6472397 16.5600000 17.3300000 17.8550000 18.2100000 20.0100000 284.0000000
16 Parent MZ Male 11 notES 41.5636254 4.6564818 32.1000000 38.0250000 41.3900000 44.6450000 65.2900000 331.0000000
17 Offspring DZ Female 17 notES 17.4752880 0.4569588 16.5600000 17.0700000 17.5900000 17.8700000 18.2900000 191.0000000
18 Parent DZ Female 17 notES 46.3055882 4.9177705 36.1000000 42.7275000 45.7650000 48.3350000 62.6900000 68.0000000
19 Offspring MZ Female 17 notES 17.4106076 0.4956190 16.5500000 16.9700000 17.3400000 17.8200000 18.4500000 395.0000000
20 Parent MZ Female 17 notES 46.3649032 5.1770435 34.8800000 42.4200000 45.9500000 49.4950000 63.1800000 155.0000000
21 Offspring DZ Male 17 notES 17.5041818 0.3915823 16.7300000 17.1900000 17.5300000 17.8300000 18.5200000 165.0000000
22 Parent DZ Male 17 notES 46.7745763 4.0226198 40.1800000 44.1250000 46.0000000 48.8200000 61.1200000 59.0000000
23 Offspring MZ Male 17 notES 17.4911446 0.3961757 16.6500000 17.1775000 17.5000000 17.8100000 18.3500000 332.0000000
24 Parent MZ Male 17 notES 46.6929771 5.2421896 34.4500000 43.1500000 45.8900000 49.0050000 63.8000000 131.0000000
That's great but there are two things I didn't like: (1) There too many
digits, especially on the integers in the last column. I thought five
digits to the right of the decimal was more than enough but here we have
seven, even for integers. (2) The ordering of levels within factors
implied by the right side of the formula is not honored -- it looks like
it used the order Cohort, ESstatus, Sex, Zygosity, Generation. Unlike
doBy::summaryBy(), it does not accept an order=T argument (that is the
default in doBy::summaryBy()).
One thing both suggestions taught me was to use names in function
definitions so that I always get correct column headings on output. This
was in the documentation for doBy::summaryBy(), but I didn't understand it
when I first read it. Using that naming concept, I created this function:
descriptivefun <- function(x, ...){c(mean=mean(x, ...), sd=sd(x, ...), quantile(x, ...), N=sum(!is.na(x)), NAs=sum(is.na(x)))}
That will allow me to feed the na.rm=T argument to the mean, sd and
quantile functions. By not naming the quantile function (e.g., not using
q=quantile(x, ...)) I allow the builtin column names to be used unaltered
(i.e., 0%, 25%, 50%, 75%, 100%). I also did not use length() because it
will count NA values and I want to see the sample sizes used for mean, sd
and quantile. To deal with that problem I created a function with output
named "N" to count those sample sizes and one with output named "NAs" to
count the number of NAs. Then I do this:
> summaryBy(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, FUN=descriptivefun, na.rm=T)
Generation Zygosity Sex Cohort ESstatus Age.mean Age.sd Age.0% Age.25% Age.50% Age.75% Age.100% Age.N Age.NAs
1 Offspring DZ Female 11 ES 17.78528 0.3535863 16.93 17.6000 17.775 17.9650 18.92 106 0
2 Offspring DZ Female 11 notES 18.13679 0.5555968 16.76 17.8525 18.190 18.4575 19.50 162 0
3 Offspring DZ Female 17 notES 17.47529 0.4569588 16.56 17.0700 17.590 17.8700 18.29 191 0
4 Offspring DZ Male 11 ES 17.76149 0.3467540 17.18 17.5150 17.715 18.0000 18.71 134 0
5 Offspring DZ Male 11 notES 17.87667 0.5187333 16.83 17.4600 17.860 18.2400 19.02 153 0
6 Offspring DZ Male 17 notES 17.50418 0.3915823 16.73 17.1900 17.530 17.8300 18.52 165 0
7 Offspring MZ Female 11 ES 17.87628 0.4506530 16.86 17.6775 17.805 18.1000 19.12 196 0
8 Offspring MZ Female 11 notES 18.05739 0.6103713 16.76 17.6300 18.050 18.4200 19.70 291 0
9 Offspring MZ Female 17 notES 17.41061 0.4956190 16.55 16.9700 17.340 17.8200 18.45 395 0
10 Offspring MZ Male 11 ES 17.77174 0.3236917 16.84 17.5800 17.790 17.9700 19.02 195 0
11 Offspring MZ Male 11 notES 17.87718 0.6472397 16.56 17.3300 17.855 18.2100 20.01 284 0
12 Offspring MZ Male 17 notES 17.49114 0.3961757 16.65 17.1775 17.500 17.8100 18.35 332 0
13 Parent DZ Female 11 ES 44.61512 5.1246314 32.17 41.3400 44.680 48.2800 57.95 121 0
14 Parent DZ Female 11 notES 42.54346 4.3670998 34.03 39.3450 42.110 45.5500 57.06 107 0
15 Parent DZ Female 17 notES 46.30559 4.9177705 36.10 42.7275 45.765 48.3350 62.69 68 0
16 Parent DZ Male 11 ES 44.60206 4.5605484 34.31 41.4475 44.890 47.4975 58.75 126 0
17 Parent DZ Male 11 notES 42.71121 4.9600561 32.05 39.2400 42.760 45.2700 58.20 157 0
18 Parent DZ Male 17 notES 46.77458 4.0226198 40.18 44.1250 46.000 48.8200 61.12 59 0
19 Parent MZ Female 11 ES 44.23476 5.0214627 29.55 40.6925 44.125 47.7300 56.73 206 0
20 Parent MZ Female 11 notES 42.31988 5.3622671 30.31 38.6050 41.835 46.0175 56.58 172 0
21 Parent MZ Female 17 notES 46.36490 5.1770435 34.88 42.4200 45.950 49.4950 63.18 155 0
22 Parent MZ Male 11 ES 43.40787 5.3507439 31.28 39.9700 43.440 46.4800 64.65 197 0
23 Parent MZ Male 11 notES 41.56363 4.6564818 32.10 38.0250 41.390 44.6450 65.29 331 0
24 Parent MZ Male 17 notES 46.69298 5.2421896 34.45 43.1500 45.890 49.0050 63.80 131 0
I think that output looks very nice. One thing that I don't understand is
why my function produces %.5f output for every value but the
doBy::summaryBy() function uses different formats in different columns.
Compare the above output with this output:
> descriptivefun(x$Age)
mean sd 0% 25% 50% 75% 100% N NAs
28.49302 13.29077 16.55000 17.65000 18.23000 42.25500 65.29000 4434.00000 0.00000
It's not a big deal, but it would be cool if I could tell
doBy::summaryBy() how to format the numbers using something like
format=c(rep("%.2f",7), rep("%d",2)).
Mike
--
Michael B. Miller, Ph.D.
Minnesota Center for Twin and Family Research
Department of Psychology
University of Minnesota
On Mon, 5 Aug 2013, David Carlson wrote:
> This is a bit simpler. The function quantile() labels the output whereas
> fivenum() does not:
>
> aggregate(Age ~ Generation + Zygosity + Sex + Cohort +
> ESstatus, data=x,
> function(x) c(mean=mean(x), sd=sd(x), quantile(x)))
On Mon, 5 Aug 2013, Dr. Thomas W. MacFarland wrote:
> Dear Dr. Miller:
>
> Pasted below is syntax that should mostly answer your recent question to
> the R mailing list:
>
> descriptivefun <- function(x, ...){
> c(m=mean(x, ...), sd=sd(x, ...), l=length(x))
> }
>
> doBy::summaryBy(Final ~ Method.recode +
> ComCol.recode,
> data=Final.table,
> FUN=descriptivefun,
> na.rm=TRUE,
> keep.names=TRUE,
> order=TRUE)
>
> I go into far more detail on this package::function and similar
> functions in my recent text on Twoway ANOVA,
> http://www.springer.com/statistics/social+sciences+%26+law/book/978-1-4614-2133-7.
>
> Best wishes.
>
> Tom
More information about the R-help
mailing list