[Rd] Potential improvements of ave?

Bill Dunlap w||||@mwdun|@p @end|ng |rom gm@||@com
Wed Mar 17 00:13:24 CET 2021


Your proposed change (roughly, replacing interaction() by
unique(paste())) slows down ave() considerably when there are long
columns with lots of repeated rows.

I think that interaction(drop=TRUE, ...) can be changed to use less
memory and be faster by making a separate branch for drop=TRUE that
uses the following idiom for finding the unique rows in a data.frame:

new.duplicated.data.frame <- function (x, incomparables = FALSE,
fromLast = FALSE, ...)
{
    dup <- !logical(nrow(x)) # all entries considered duplicated until
proven otherwise
    for(column in x) {
        dup <- dup & duplicated(column, incomparables = incomparables,
fromLast = fromLast)
    }
    dup
}

ave() could use the above directly or it could call interaction(drop=TRUE,...).

On Tue, Mar 16, 2021 at 3:50 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr> wrote:
>
> Dear all,
>
> Thank you for your consideration on this topic.
>
> I do not have enough knowledge of R internals to join the discussion about sorting mechanisms. In fact, I did not get how ordering could help for ave as the output must maintain the order of the input (because ave returns only x and not the entiere data.frame).
>
> However, while the proposed workaround (i.e. paste0 instead of interaction, cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves the "bigger problem" of sorting, it is usable as is and solves the issue. Therefore, what do you think about it? (i.e is it relevant for a patch?)
>
> Thanks,
>
> Thomas
>
>
> > ________________________________________
> > De : Abby Spurdle <spurdle.a using gmail.com>
> > Envoyé : lundi 15 mars 2021 10:22
> > À : SOEIRO Thomas
> > Cc : r-devel using r-project.org
> > Objet : Re: [Rd] Potential improvements of ave?
> >
> > Hi Thomas,
> >
> > These are some great suggestions.
> > But I can't help but feel there's a much bigger problem here.
> >
> > Intuitively, the ave function could (or should) sort the data.
> > Then the indexing step becomes almost trivial, in terms of both time
> > and space complexity.
> > And the ave function is not the only example of where a problem
> > becomes much simpler, if the data is sorted.
> >
> > Historically, I've never found base R functions user-friendly for
> > aggregation purposes, or for sorting.
> > (At least, not by comparison to SQL).
> >
> > But that's not the main problem.
> > It would seem preferable to sort the data, only once.
> > (Rather than sorting it repeatedly, or not at all).
> >
> > Perhaps, objects such as vectors and data.frame(s) could have a
> > boolean attribute, to indicate if they're sorted.
> > Or functions such as ave could have a sorted argument.
> > In either case, if true, the function assumes the data is sorted and
> > applies a more efficient algorithm.
> >
> >
> > B.
> >
> >
> > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr> wrote:
> >>
> >> Dear all,
> >>
> >> I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.
> >>
> >>
> >>
> >> 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:
> >>
> >> set.seed(1)
> >>
> >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >>                   id2 = sample(1:3, 5e2, TRUE),
> >>                   id3 = sample(1:5, 5e2, TRUE),
> >>                   val = sample(1:300, 5e2, TRUE))
> >>
> >> df1$diff <- ave(df1$val,
> >>                 df1$id1,
> >>                 df1$id2,
> >>                 df1$id3,
> >>                 FUN = function(i) c(diff(i), 0))
> >>
> >> head(df1[order(df1$id1,
> >>                df1$id2,
> >>                df1$id3), ])
> >>
> >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):
> >>
> >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
> >>                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
> >>                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
> >>                   val = sample(1:300, 5e2 * 1e4, TRUE))
> >>
> >> df2$diff <- ave(df2$val,
> >>                 df2$id1,
> >>                 df2$id2,
> >>                 df2$id3,
> >>                 FUN = function(i) c(diff(i), 0))
> >>
> >> This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
> >> So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?
> >>
> >>
> >>
> >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
> >> Is it relevant/possible to expose the drop argument explicitly?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Thomas
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list