[R] aggregate.data.frame(drop=FALSE) in R 3.3.0
Suharto Anggono Suharto Anggono
suharto_anggono at yahoo.com
Sat May 14 08:38:54 CEST 2016
>From NEWS: The data frame and formula methods for aggregate() gain a drop argument.
Here, I highlight behavior of 'aggregate.data.frame' with drop=FALSE in R 3.3.0.
Example 1, modified from "example with character variables and NAs" in "Example" in R help on 'aggregate':
> testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
+ v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
> by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
> by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)
> str(aggregate(x = testDF, by = list(by1, by2), FUN = "mean", drop = FALSE))
'data.frame': 30 obs. of 4 variables:
$ Group.1: Factor w/ 5 levels "1","2","big",..: 1 2 3 4 5 1 2 3 4 5 ...
$ Group.2: Factor w/ 6 levels "95","99","damp",..: 1 1 1 1 1 2 2 2 2 2 ...
$ v1 : num 5 7 NaN NaN NaN 5 NA NaN NaN NaN ...
$ v2 : num 55 77 NaN NaN NaN 55 NA NaN NaN NaN ...
- attr(*, "out.attrs")=List of 2
..$ dim : Named int 5 6
.. ..- attr(*, "names")= chr "Group.1" "Group.2"
..$ dimnames:List of 2
.. ..$ Group.1: chr "Group.1=1" "Group.1=2" "Group.1=big" "Group.1=blue" ...
.. ..$ Group.2: chr "Group.2=95" "Group.2=99" "Group.2=damp" "Group.2=dry" ..
.
> str(aggregate(x = testDF, by = list(by1, by2), FUN = "mean"))
'data.frame': 8 obs. of 4 variables:
$ Group.1: chr "1" "2" "1" "2" ...
$ Group.2: chr "95" "95" "99" "99" ...
$ v1 : num 5 7 5 NA 3 3 4 1
$ v2 : num 55 77 55 NA 33 33 44 11
The result of 'aggregate.data.frame' with drop=FALSE has attribute "out.attrs"; the result of default 'aggregate.data.frame' (drop=TRUE) doesn't.
Character grouping variable becomes a factor in the result of 'aggregate.data.frame' with drop=FALSE; stays as character in the result of default 'aggregate.data.frame' (drop=TRUE).
Example 2, modified from "Compute the averages according to region and the occurrence of more than 130 days of frost" in "Examples" in R help on 'aggregate':
> aggregate(state.x77,
+ list(Region = state.region,
+ Cold = state.x77[,"Frost"] > 130),
+ mean, drop = FALSE)
Region Cold Population Income Illiteracy Life Exp Murder
1 Northeast FALSE 8802.8000 4780.400 1.1800000 71.12800 5.580000
2 South FALSE 4208.1250 4011.938 1.7375000 69.70625 10.581250
3 North Central FALSE 7233.8333 4633.333 0.7833333 70.95667 8.283333
4 West FALSE 4582.5714 4550.143 1.2571429 71.70000 6.828571
5 Northeast TRUE 1360.5000 4307.500 0.7750000 71.43500 3.650000
6 South TRUE NaN NaN NaN NaN NaN
7 North Central TRUE 2372.1667 4588.833 0.6166667 72.57667 2.266667
8 West TRUE 970.1667 4880.500 0.7500000 70.69167 7.666667
HS Grad Frost Area
1 52.06000 110.6000 21838.60
2 44.34375 64.6250 54605.12
3 53.36667 120.0000 56736.50
4 60.11429 51.0000 91863.71
5 56.35000 160.5000 13519.00
6 NaN NaN NaN
7 55.66667 157.6667 68567.50
8 64.20000 161.8333 184162.17
> aggregate(state.x77,
+ list(Region = state.region,
+ Cold = state.x77[,"Frost"] > 130),
+ mean)
Region Cold Population Income Illiteracy Life Exp Murder
1 Northeast FALSE 8802.8000 4780.400 1.1800000 71.12800 5.580000
2 South FALSE 4208.1250 4011.938 1.7375000 69.70625 10.581250
3 North Central FALSE 7233.8333 4633.333 0.7833333 70.95667 8.283333
4 West FALSE 4582.5714 4550.143 1.2571429 71.70000 6.828571
5 Northeast TRUE 1360.5000 4307.500 0.7750000 71.43500 3.650000
6 North Central TRUE 2372.1667 4588.833 0.6166667 72.57667 2.266667
7 West TRUE 970.1667 4880.500 0.7500000 70.69167 7.666667
HS Grad Frost Area
1 52.06000 110.6000 21838.60
2 44.34375 64.6250 54605.12
3 53.36667 120.0000 56736.50
4 60.11429 51.0000 91863.71
5 56.35000 160.5000 13519.00
6 55.66667 157.6667 68567.50
7 64.20000 161.8333 184162.17
Unlike 'tapply', in 'aggregate.data.frame' with drop=FALSE, the function (mean in example 2 above) is also applied to subset corresponding to combination of grouping variables that doesn't appear in the data.
Example 3, modified from http://stackoverflow.com/questions/22523131/dplyr-summarise-equivalent-of-drop-false-to-keep-groups-with-zero-length-in :
> DF <- data.frame(a=rep(1:3,4), b=factor(rep(1:2,6), levels=1:3))
> aggregate(DF["a"], DF["b"], length, drop=FALSE)
b a
1 1 6
2 2 6
Unlike 'interaction' with drop=FALSE, or 'tapply', for factor grouping variable, levels that never appear in the data (in example 3 above, "3" in 'b') don't appear in the result of 'aggregate.data.frame' with drop=FALSE.
More information about the R-help
mailing list