[R] aggregate formula - differing results

Mon Sep 4 12:52:39 CEST 2023

Thanks Iago for the pointer.

It then means that na.rm = TRUE is not applied in the same way within 
aggregate() as opposed to dplyr::group_by() + summarise(), right? Within 
aggregate, it behaves like na.omit(), that is, it excludes the 
incomplete cases (whole rows), whereas with group_by() + summarise() it 
is applied on each vector (variable), which is what I actually would expect.

I hadn't showed it, but doBy::summaryBy() produces the same results as 
group_by() + summarise().

Ivan

On 04/09/2023 12:45, Iago Giné Vázquez wrote:
> It seems that the issue are the missings. If in  #1 you use the 
> dataset na.omit(my_data) instead of my_data, you get the same output 
> that in #2 and in #4, where all observations with missing data are 
> removed since you are including all the variables.
>
>
> The second dataset has no issue since it has no missing data.
>
> Iago
> ------------------------------------------------------------------------
> *De:* R-help <r-help-bounces using r-project.org> de part de Ivan Calandra 
> <ivan.calandra using leiza.de>
> *Enviat el:* dilluns, 4 de setembre de 2023 11:44
> *Per a:* R-help <r-help using r-project.org>
> *Tema:* [R] aggregate formula - differing results
> Dear useRs,
>
> I have just stumbled across a behavior in aggregate() that I cannot
> explain. Any help would be appreciated!
>
> Sample data:
> my_data <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
> "FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
> "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
> 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
> 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
> Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
> 32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
> 25.149, 33.427, 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84,
> NA, 15.33, 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK =
> c(8.67, NA, 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78,
> NA, 78, 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
> 5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6, 14.9,
> 29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT", "FLINT",
> "FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS")),
> row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L, 114L, 115L), class =
> "data.frame")
>
> 1) Simple aggregation with 2 variables:
> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN = mean,
> na.rm = TRUE)
>
> 2) Using the dot notation - different results:
> aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
>
> 3) Using dplyr, I get the same results as #1:
> group_by(my_data, RAWMAT) %>%
>    summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
>
> 4) It gets weirder: using all columns in #1 give the same results as in
> #2 but different from #1 and #3
> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
> PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN = mean,
> na.rm = TRUE)
>
> So it seems it is not only due to the notation (cbind() vs. dot). Is it
> a bug? A peculiar thing in my dataset? I tend to think this could be due
> to some variables (or their names) as all notations seem to agree when I
> remove some variables (although I haven't found out which variable(s) is
> (are) at fault), e.g.:
>
> my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10", "FLINT-100",
> "FLINT-101", "FLINT-102", "HORN-10", "HORN-100", "HORN-102", "HORN-103",
> "HORN-104"), EdgeLength = c(130.75, 168.77, 142.79, 130.1, 140.41,
> 121.37, 70.52, 122.3, 71.01, 104.5), SurfaceArea = c(1736.87, 1571.83,
> 1656.46, 1247.18, 1177.47, 1169.26, 444.61, 1791.48, 461.15, 1127.2),
> Length = c(44.384, 29.831, 43.869, 48.011, 54.109, 41.742, 23.854,
> 32.075, 21.337, 35.459), Width = c(45.982, 67.303, 52.679, 26.42,
> 25.149, 33.427, 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT",
> "FLINT", "FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS",
> "HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L,
> 113L, 114L, 115L), class = "data.frame")
>
> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT, data =
> my_data2, FUN = mean, na.rm = TRUE)
>
> aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
>
> group_by(my_data2, RAWMAT) %>%
>    summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
>
>
> Thank you in advance for any hint.
> Best wishes,
> Ivan
>
>
>
>
>         *LEIBNIZ-ZENTRUM*
> *FÜR ARCHÄOLOGIE*
>
> *Dr. Ivan CALANDRA*
> **Head of IMPALA (IMaging Platform At LeizA)
>
> *MONREPOS* Archaeological Research Centre, Schloss Monrepos
> 56567 Neuwied, Germany
>
> T: +49 2631 9772 243
> T: +49 6131 8885 543
> ivan.calandra using leiza.de
>
> leiza.de <http://www.leiza.de/>
> <http://www.leiza.de/>
> ORCID <https://orcid.org/0000-0003-3816-6359>
> ResearchGate
> <https://www.researchgate.net/profile/Ivan_Calandra>
>
> LEIZA is a foundation under public law of the State of
> Rhineland-Palatinate and the City of Mainz. Its headquarters are in
> Mainz. Supervision is carried out by the Ministry of Science and Health
> of the State of Rhineland-Palatinate. LEIZA is a research museum of the
> Leibniz Association.
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
	[[alternative HTML version deleted]]