[R] descriptive stats by cells in factorial design
David Winsemius
dwinsemius at comcast.net
Wed Aug 7 02:20:17 CEST 2013
On Aug 6, 2013, at 4:02 PM, Mike Miller wrote:
> I received two additional suggestions, one off-list, both appended below. Both helped me to learn a bit more about how to get what I want.
>
> First, the aggregate() function is in package:stats, it provides the numbers I needed, but I don't like the output format as much as I liked the format from doBy:summaryBy(). Here it is:
>
>> aggregate(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, function(x) c(mean=mean(x), sd=sd(x), quantile(x), N=length(x)))
> Generation Zygosity Sex Cohort ESstatus Age.mean Age.sd Age.0% Age.25% Age.50% Age.75% Age.100% Age.N
> 1 Offspring DZ Female 11 ES 17.7852830 0.3535863 16.9300000 17.6000000 17.7750000 17.9650000 18.9200000 106.0000000
> 2 Parent DZ Female 11 ES 44.6151240 5.1246314 32.1700000 41.3400000 44.6800000 48.2800000 57.9500000 121.0000000
>
snipped
> 23 Offspring MZ Male 17 notES 17.4911446 0.3961757 16.6500000 17.1775000 17.5000000 17.8100000 18.3500000 332.0000000
> 24 Parent MZ Male 17 notES 46.6929771 5.2421896 34.4500000 43.1500000 45.8900000 49.0050000 63.8000000 131.0000000
>
> That's great but there are two things I didn't like: (1) There too many digits, especially on the integers in the last column. I thought five digits to the right of the decimal was more than enough but here we have seven, even for integers. (2) The ordering of levels within factors implied by the right side of the formula is not honored -- it looks like it used the order Cohort, ESstatus, Sex, Zygosity, Generation. Unlike doBy::summaryBy(), it does not accept an order=T argument (that is the default in doBy::summaryBy()).
>
> One thing both suggestions taught me was to use names in function definitions so that I always get correct column headings on output. This was in the documentation for doBy::summaryBy(), but I didn't understand it when I first read it. Using that naming concept, I created this function:
>
> descriptivefun <- function(x, ...){c(mean=mean(x, ...), sd=sd(x, ...), quantile(x, ...), N=sum(!is.na(x)), NAs=sum(is.na(x)))}
>
> That will allow me to feed the na.rm=T argument to the mean, sd and quantile functions. By not naming the quantile function (e.g., not using q=quantile(x, ...)) I allow the builtin column names to be used unaltered (i.e., 0%, 25%, 50%, 75%, 100%). I also did not use length() because it will count NA values and I want to see the sample sizes used for mean, sd and quantile. To deal with that problem I created a function with output named "N" to count those sample sizes and one with output named "NAs" to count the number of NAs. Then I do this:
>
>> summaryBy(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, FUN=descriptivefun, na.rm=T)
> Generation Zygosity Sex Cohort ESstatus Age.mean Age.sd Age.0% Age.25% Age.50% Age.75% Age.100% Age.N Age.NAs
> 1 Offspring DZ Female 11 ES 17.78528 0.3535863 16.93 17.6000 17.775 17.9650 18.92 106 0
> 2 Offspring DZ Female 11 notES 18.13679 0.5555968 16.76 17.8525 18.190 18.4575 19.50 162 0
>
snipped
> 22 Parent MZ Male 11 ES 43.40787 5.3507439 31.28 39.9700 43.440 46.4800 64.65 197 0
> 23 Parent MZ Male 11 notES 41.56363 4.6564818 32.10 38.0250 41.390 44.6450 65.29 331 0
> 24 Parent MZ Male 17 notES 46.69298 5.2421896 34.45 43.1500 45.890 49.0050 63.80 131 0
>
> I think that output looks very nice. One thing that I don't understand is why my function produces %.5f output for every value but the doBy::summaryBy() function uses different formats in different columns.
Look at the code. You are attributing behavior to `summaryBy` that should be ascribed to `print.data.frame`, and to `format.data.frame`. Your function is returning a numeric vector and getting displayed by `print.default`.
--
David.
> Compare the above output with this output:
>
>> descriptivefun(x$Age)
> mean sd 0% 25% 50% 75% 100% N NAs
> 28.49302 13.29077 16.55000 17.65000 18.23000 42.25500 65.29000 4434.00000 0.00000
>
> It's not a big deal, but it would be cool if I could tell doBy::summaryBy() how to format the numbers using something like format=c(rep("%.2f",7), rep("%d",2)).
>
> Mike
>
> --
> Michael B. Miller, Ph.D.
> Minnesota Center for Twin and Family Research
> Department of Psychology
> University of Minnesota
>
>
>
> On Mon, 5 Aug 2013, David Carlson wrote:
>
>> This is a bit simpler. The function quantile() labels the output whereas fivenum() does not:
>>
>> aggregate(Age ~ Generation + Zygosity + Sex + Cohort +
>> ESstatus, data=x,
>> function(x) c(mean=mean(x), sd=sd(x), quantile(x)))
>
>
> On Mon, 5 Aug 2013, Dr. Thomas W. MacFarland wrote:
>
>> Dear Dr. Miller:
>>
>> Pasted below is syntax that should mostly answer your recent question to the R mailing list:
>>
>> descriptivefun <- function(x, ...){
>> c(m=mean(x, ...), sd=sd(x, ...), l=length(x))
>> }
>>
>> doBy::summaryBy(Final ~ Method.recode +
>> ComCol.recode,
>> data=Final.table,
>> FUN=descriptivefun,
>> na.rm=TRUE,
>> keep.names=TRUE,
>> order=TRUE)
>>
>> I go into far more detail on this package::function and similar functions in my recent text on Twoway ANOVA,
>> http://www.springer.com/statistics/social+sciences+%26+law/book/978-1-4614-2133-7.
>>
>> Best wishes.
>>
>> Tom
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list