[R] aggregate formula - differing results

Mon Sep 4 13:51:28 CEST 2023

Thanks Rui for your help; that would be one possibility indeed.

But am I the only one who finds that behavior of aggregate() completely 
unexpected and confusing? Especially considering that dplyr::summarise() 
and doBy::summaryBy() deal with NAs differently, even though they all 
use mean(na.rm = TRUE) to calculate the group stats.

Best wishes,
Ivan

On 04/09/2023 13:46, Rui Barradas wrote:
> Às 10:44 de 04/09/2023, Ivan Calandra escreveu:
>> Dear useRs,
>>
>> I have just stumbled across a behavior in aggregate() that I cannot 
>> explain. Any help would be appreciated!
>>
>> Sample data:
>> my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100", 
>> "FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", 
>> "HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 
>> 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = 
>> c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 
>> 1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011, 
>> 54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982, 
>> 67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417, 
>> 35.297), PLATWIDTH = c(38.84, NA, 15.33, 30.37, 11.44, 14.88, 13.86, 
>> NA, NA, 26.71), PLATTHICK = c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 
>> 4.58, NA, NA, 9.35), EPA = c(78, NA, 78, 54, 72, 49, 56, NA, NA, 56), 
>> THICKNESS = c(10.97, NA, 9.36, 6.4, 5.89, 11.05, 4.9, NA, NA, 10.08), 
>> WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9, 29.5, 4.5, NA, NA, 23), RAWMAT 
>> = c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS", 
>> "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")), row.names = c(1L, 
>> 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class = "data.frame")
>>
>> 1) Simple aggregation with 2 variables:
>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean, 
>> na.rm = TRUE)
>>
>> 2) Using the dot notation - different results:
>> aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
>>
>> 3) Using dplyr, I get the same results as #1:
>> group_by(my_data, RAWMAT) %>%
>>    summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
>>
>> 4) It gets weirder: using all columns in #1 give the same results as 
>> in #2 but different from #1 and #3
>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH, 
>> PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = 
>> mean, na.rm = TRUE)
>>
>> So it seems it is not only due to the notation (cbind() vs. dot). Is 
>> it a bug? A peculiar thing in my dataset? I tend to think this could 
>> be due to some variables (or their names) as all notations seem to 
>> agree when I remove some variables (although I haven't found out 
>> which variable(s) is (are) at fault), e.g.:
>>
>> my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100", 
>> "FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", 
>> "HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 
>> 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = 
>> c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 
>> 1791.48, 461.15, 1127.2), Length = c(44.384, 29.831, 43.869, 48.011, 
>> 54.109, 41.742, 23.854, 32.075, 21.337, 35.459), Width = c(45.982, 
>> 67.303, 52.679, 26.42, 25.149, 33.427, 20.683, 62.783, 26.417, 
>> 35.297), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT", "FLINT", 
>> "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")), 
>> row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), 
>> class = "data.frame")
>>
>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, 
>> data = my_data2, FUN = mean, na.rm = TRUE)
>>
>> aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
>>
>> group_by(my_data2, RAWMAT) %>%
>>    summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
>>
>>
>> Thank you in advance for any hint.
>> Best wishes,
>> Ivan
>>
>>
>>
>>
>>      *LEIBNIZ-ZENTRUM*
>> *FÜR ARCHÄOLOGIE*
>>
>> *Dr. Ivan CALANDRA*
>> **Head of IMPALA (IMaging Platform At LeizA)
>>
>> *MONREPOS* Archaeological Research Centre, Schloss Monrepos
>> 56567 Neuwied, Germany
>>
>> T: +49 2631 9772 243
>> T: +49 6131 8885 543
>> ivan.calandra using leiza.de
>>
>> leiza.de <http://www.leiza.de/>
>> <http://www.leiza.de/>
>> ORCID <https://orcid.org/0000-0003-3816-6359>
>> ResearchGate
>> <https://www.researchgate.net/profile/Ivan_Calandra>
>>
>> LEIZA is a foundation under public law of the State of 
>> Rhineland-Palatinate and the City of Mainz. Its headquarters are in 
>> Mainz. Supervision is carried out by the Ministry of Science and 
>> Health of the State of Rhineland-Palatinate. LEIZA is a research 
>> museum of the Leibniz Association.
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> You can define a vector of the columns of interest and subset the data 
> with it. Then the default na.action = na.omit will no longer remove 
> the rows with NA vals in at least one column and the results are the 
> same.
>
> However, this will not give the mean values of the other numeric 
> columns, just of those two.
>
>
>
> # define a vector of columns of interest
> cols <- c("Length", "Width", "RAWMAT")
>
> # 1) Simple aggregation with 2 variables, select cols:
> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN = 
> mean, na.rm = TRUE)
>
> # 2) Using the dot notation - if cols are selected, equal results:
> aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
>
> # 3) Using dplyr, the results are now the same results as #1 and #2:
> my_data %>%
>   select(all_of(cols)) %>%
>   group_by(RAWMAT) %>%
>   summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
>
>
> Hope this helps,
>
> Rui Barradas
>