[Rd] Potential improvements of ave?

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Mon Mar 15 23:08:44 CET 2021


Abby,

Vectors do have an internal mechanism for knowing that they are sorted via
ALTREP (it was one of 2 core motivating features for 'smart vectors' the
other being knowledge about presence of NAs).

Currently I don't think we expose it at the R level, though it is part of
the official C API. I don't know of any plans for this to change, but I
suppose it could. Plus for functions in R itself, we could even use it
without exposing it more widely. A number of functions, including sort
itself, already do this in fact, but more could. I'd be interested in
hearing which functions you think would particularly benefit from this.

~G

On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr>
wrote:

> Hi Abby,
>
> Thank you for your positive feedback.
>
> I agree for your general comment about sorting.
>
> For ave specifically, ordering may not help because the output must
> maintain the order of the input (as ave returns only x and not the entiere
> data.frame).
>
> Thanks,
>
> Thomas
> ________________________________________
> De : Abby Spurdle <spurdle.a using gmail.com>
> Envoyé : lundi 15 mars 2021 10:22
> À : SOEIRO Thomas
> Cc : r-devel using r-project.org
> Objet : Re: [Rd] Potential improvements of ave?
>
> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
>
> Hi Thomas,
>
> These are some great suggestions.
> But I can't help but feel there's a much bigger problem here.
>
> Intuitively, the ave function could (or should) sort the data.
> Then the indexing step becomes almost trivial, in terms of both time
> and space complexity.
> And the ave function is not the only example of where a problem
> becomes much simpler, if the data is sorted.
>
> Historically, I've never found base R functions user-friendly for
> aggregation purposes, or for sorting.
> (At least, not by comparison to SQL).
>
> But that's not the main problem.
> It would seem preferable to sort the data, only once.
> (Rather than sorting it repeatedly, or not at all).
>
> Perhaps, objects such as vectors and data.frame(s) could have a
> boolean attribute, to indicate if they're sorted.
> Or functions such as ave could have a sorted argument.
> In either case, if true, the function assumes the data is sorted and
> applies a more efficient algorithm.
>
>
> B.
>
>
> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr>
> wrote:
> >
> > Dear all,
> >
> > I have two questions/suggestions about ave, but I am not sure if it's
> relevant for bug reports.
> >
> >
> >
> > 1) I have performance issues with ave in a case where I didn't expect
> it. The following code runs as expected:
> >
> > set.seed(1)
> >
> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >                   id2 = sample(1:3, 5e2, TRUE),
> >                   id3 = sample(1:5, 5e2, TRUE),
> >                   val = sample(1:300, 5e2, TRUE))
> >
> > df1$diff <- ave(df1$val,
> >                 df1$id1,
> >                 df1$id2,
> >                 df1$id3,
> >                 FUN = function(i) c(diff(i), 0))
> >
> > head(df1[order(df1$id1,
> >                df1$id2,
> >                df1$id3), ])
> >
> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
> allocate vector of size 1110.0 Gb):
> >
> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
> >
> > df2$diff <- ave(df2$val,
> >                 df2$id1,
> >                 df2$id2,
> >                 df2$id3,
> >                 FUN = function(i) c(diff(i), 0))
> >
> > This use case does not seem extreme to me (e.g. aggregate et al work
> perfectly on this data.frame).
> > So my question is: Is this expected/intended/reasonable? i.e. Does ave
> need to be optimized?
> >
> >
> >
> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to
> avoid warnings in case of unused levels (
> https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
> ).
> > Is it relevant/possible to expose the drop argument explicitly?
> >
> >
> >
> > Thanks,
> >
> > Thomas
> > ______________________________________________
> > R-devel using r-project.org mailing list
> >
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list