[R] aggregate formula - differing results
Bert Gunter
bgunter@4567 @end|ng |rom gm@||@com
Mon Sep 4 16:32:20 CEST 2023
Ivan:
Just one perhaps extraneous comment.
You said that you were surprised that aggregate() and group_by() did not
have the same behavior. That is a misconception on your part. As you know,
the tidyverse recapitulates the functionality of many base R functions; but
it makes no claims to do so in exactly the same way and, indeed, often
makes deliberate changes to "improve" behavior. So if you wish to use both,
you should *expect* such differences, which, of course, are documented in
the man pages (and often elsewhere).
Cheers,
Bert
On Mon, Sep 4, 2023 at 5:21 AM Ivan Calandra <ivan.calandra using leiza.de> wrote:
> Haha, got it now, there is an na.action argument (which defaults to
> na.omit) to aggregate() which is applied before calling mean(na.rm =
> TRUE). Thank you Rui for pointing this out.
>
> So running it with na.pass instead of na.omit gives the same results as
> dplyr::group_by()+summarise():
> aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE,
> na.action = na.pass)
>
> Cheers,
> Ivan
>
> On 04/09/2023 13:56, Rui Barradas wrote:
> > Às 12:51 de 04/09/2023, Ivan Calandra escreveu:
> >> Thanks Rui for your help; that would be one possibility indeed.
> >>
> >> But am I the only one who finds that behavior of aggregate()
> >> completely unexpected and confusing? Especially considering that
> >> dplyr::summarise() and doBy::summaryBy() deal with NAs differently,
> >> even though they all use mean(na.rm = TRUE) to calculate the group
> >> stats.
> >>
> >> Best wishes,
> >> Ivan
> >>
> >> On 04/09/2023 13:46, Rui Barradas wrote:
> >>> Às 10:44 de 04/09/2023, Ivan Calandra escreveu:
> >>>> Dear useRs,
> >>>>
> >>>> I have just stumbled across a behavior in aggregate() that I cannot
> >>>> explain. Any help would be appreciated!
> >>>>
> >>>> Sample data:
> >>>> my_data <- structure(list(ID = c("FLINT-1", "FLINT-10",
> >>>> "FLINT-100", "FLINT-101", "FLINT-102", "HORN-10", "HORN-100",
> >>>> "HORN-102", "HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77,
> >>>> 142.79, 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5),
> >>>> SurfaceArea = c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47,
> >>>> 1169.26, 444.61, 1791.48, 461.15, 1127.2), Length = c(44.384,
> >>>> 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
> >>>> 35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427,
> >>>> 20.683, 62.783, 26.417, 35.297), PLATWIDTH = c(38.84, NA, 15.33,
> >>>> 30.37, 11.44, 14.88, 13.86, NA, NA, 26.71), PLATTHICK = c(8.67, NA,
> >>>> 7.99, 11.69, 3.3, 16.52, 4.58, NA, NA, 9.35), EPA = c(78, NA, 78,
> >>>> 54, 72, 49, 56, NA, NA, 56), THICKNESS = c(10.97, NA, 9.36, 6.4,
> >>>> 5.89, 11.05, 4.9, NA, NA, 10.08), WEIGHT = c(34.3, NA, 25.5, 18.6,
> >>>> 14.9, 29.5, 4.5, NA, NA, 23), RAWMAT = c("FLINT", "FLINT", "FLINT",
> >>>> "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS", "HORNFELS",
> >>>> "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L, 112L, 113L,
> >>>> 114L, 115L), class = "data.frame")
> >>>>
> >>>> 1) Simple aggregation with 2 variables:
> >>>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data, FUN =
> >>>> mean, na.rm = TRUE)
> >>>>
> >>>> 2) Using the dot notation - different results:
> >>>> aggregate(. ~ RAWMAT, data = my_data[-1], FUN = mean, na.rm = TRUE)
> >>>>
> >>>> 3) Using dplyr, I get the same results as #1:
> >>>> group_by(my_data, RAWMAT) %>%
> >>>> summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
> >>>>
> >>>> 4) It gets weirder: using all columns in #1 give the same results
> >>>> as in #2 but different from #1 and #3
> >>>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width, PLATWIDTH,
> >>>> PLATTHICK, EPA, THICKNESS, WEIGHT) ~ RAWMAT, data = my_data, FUN =
> >>>> mean, na.rm = TRUE)
> >>>>
> >>>> So it seems it is not only due to the notation (cbind() vs. dot).
> >>>> Is it a bug? A peculiar thing in my dataset? I tend to think this
> >>>> could be due to some variables (or their names) as all notations
> >>>> seem to agree when I remove some variables (although I haven't
> >>>> found out which variable(s) is (are) at fault), e.g.:
> >>>>
> >>>> my_data2 <- structure(list(ID = c("FLINT-1", "FLINT-10",
> >>>> "FLINT-100", "FLINT-101", "FLINT-102", "HORN-10", "HORN-100",
> >>>> "HORN-102", "HORN-103", "HORN-104"), EdgeLength = c(130.75, 168.77,
> >>>> 142.79, 130.1, 140.41, 121.37, 70.52, 122.3, 71.01, 104.5),
> >>>> SurfaceArea = c(1736.87, 1571.83, 1656.46, 1247.18, 1177.47,
> >>>> 1169.26, 444.61, 1791.48, 461.15, 1127.2), Length = c(44.384,
> >>>> 29.831, 43.869, 48.011, 54.109, 41.742, 23.854, 32.075, 21.337,
> >>>> 35.459), Width = c(45.982, 67.303, 52.679, 26.42, 25.149, 33.427,
> >>>> 20.683, 62.783, 26.417, 35.297), RAWMAT = c("FLINT", "FLINT",
> >>>> "FLINT", "FLINT", "FLINT", "HORNFELS", "HORNFELS", "HORNFELS",
> >>>> "HORNFELS", "HORNFELS")), row.names = c(1L, 2L, 3L, 4L, 5L, 111L,
> >>>> 112L, 113L, 114L, 115L), class = "data.frame")
> >>>>
> >>>> aggregate(cbind(EdgeLength, SurfaceArea, Length, Width) ~ RAWMAT,
> >>>> data = my_data2, FUN = mean, na.rm = TRUE)
> >>>>
> >>>> aggregate(. ~ RAWMAT, data = my_data2[-1], FUN = mean, na.rm = TRUE)
> >>>>
> >>>> group_by(my_data2, RAWMAT) %>%
> >>>> summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
> >>>>
> >>>>
> >>>> Thank you in advance for any hint.
> >>>> Best wishes,
> >>>> Ivan
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> *LEIBNIZ-ZENTRUM*
> >>>> *FÜR ARCHÄOLOGIE*
> >>>>
> >>>> *Dr. Ivan CALANDRA*
> >>>> **Head of IMPALA (IMaging Platform At LeizA)
> >>>>
> >>>> *MONREPOS* Archaeological Research Centre, Schloss Monrepos
> >>>> 56567 Neuwied, Germany
> >>>>
> >>>> T: +49 2631 9772 243
> >>>> T: +49 6131 8885 543
> >>>> ivan.calandra using leiza.de
> >>>>
> >>>> leiza.de <http://www.leiza.de/>
> >>>> <http://www.leiza.de/>
> >>>> ORCID <https://orcid.org/0000-0003-3816-6359>
> >>>> ResearchGate
> >>>> <https://www.researchgate.net/profile/Ivan_Calandra>
> >>>>
> >>>> LEIZA is a foundation under public law of the State of
> >>>> Rhineland-Palatinate and the City of Mainz. Its headquarters are in
> >>>> Mainz. Supervision is carried out by the Ministry of Science and
> >>>> Health of the State of Rhineland-Palatinate. LEIZA is a research
> >>>> museum of the Leibniz Association.
> >>>> ______________________________________________
> >>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>> Hello,
> >>>
> >>> You can define a vector of the columns of interest and subset the
> >>> data with it. Then the default na.action = na.omit will no longer
> >>> remove the rows with NA vals in at least one column and the results
> >>> are the same.
> >>>
> >>> However, this will not give the mean values of the other numeric
> >>> columns, just of those two.
> >>>
> >>>
> >>>
> >>> # define a vector of columns of interest
> >>> cols <- c("Length", "Width", "RAWMAT")
> >>>
> >>> # 1) Simple aggregation with 2 variables, select cols:
> >>> aggregate(cbind(Length, Width) ~ RAWMAT, data = my_data[cols], FUN =
> >>> mean, na.rm = TRUE)
> >>>
> >>> # 2) Using the dot notation - if cols are selected, equal results:
> >>> aggregate(. ~ RAWMAT, data = my_data[cols], FUN = mean, na.rm = TRUE)
> >>>
> >>> # 3) Using dplyr, the results are now the same results as #1 and #2:
> >>> my_data %>%
> >>> select(all_of(cols)) %>%
> >>> group_by(RAWMAT) %>%
> >>> summarise(across(c("Length", "Width"), ~ mean(.x, na.rm = TRUE)))
> >>>
> >>>
> >>> Hope this helps,
> >>>
> >>> Rui Barradas
> >>>
> > Hello,
> >
> > Puzzling at first yes, unexpected no, it's documented behavior.
> >
> > This is the result of how the aggregate works, by first applying
> > na.action to the data and only then applying the function to each
> > column in the formula's LHS.
> >
> > dplyr works column by column, so there is no na.action involved.
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> >
> >
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list