[Rd] Potential improvements of ave?

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Wed Mar 17 00:06:42 CET 2021


Hi Abby,

I actually have a patch submitted that does this for unique/duplicated
(only numeric cases I think) but it is, as patches from external
contributors go, quite sizable which means it requires a correspondingly
large amount of an R-core member's time and energy to vet and consider. It
is in the queue, and so, I expect (/hope, provided I didn't make a mistake)
it will be incorporated at some point. (
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17993)

You are correct that the speedups are quite significant for calling
unique/duplicated on large vectors that know they are sorted: Speedup on my
machine for a fairly sizable vector (length 1e7) ranges from about ~10x in
the densely duplicated case up to ~60-70x in the sparsely duplicated case
for duplicated(). For unique() it seems to range from ~10x in the densely
duplicated case to ~15 in the spare case.

I had thought that min and max already did this, but looking now, they
don't seem to by default, thought ALTREP classes themselves do have an
option of setting a min/max method, which would be hit. That does seem like
low-hanging fruit, I agree, though in many cases the slow down from a
single pass over the data to get a min probably isn't earthshattering.

The others do seem like they could benefit as well.

Best,
~G

On Tue, Mar 16, 2021 at 2:54 PM Abby Spurdle <spurdle.a using gmail.com> wrote:

> There are some relatively obvious examples:
> unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split
>
> Also, many timeseries, graphics and spline functions are dependent on the
> order.
>
> In the case of data.frame(s), a boolean flag would probably need to be
> extended to allow for multiple column sorting, and
> ascending/descending options.
>
> On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <gabembecker using gmail.com>
> wrote:
> >
> > Abby,
> >
> > Vectors do have an internal mechanism for knowing that they are sorted
> via ALTREP (it was one of 2 core motivating features for 'smart vectors'
> the other being knowledge about presence of NAs).
> >
> > Currently I don't think we expose it at the R level, though it is part
> of the official C API. I don't know of any plans for this to change, but I
> suppose it could. Plus for functions in R itself, we could even use it
> without exposing it more widely. A number of functions, including sort
> itself, already do this in fact, but more could. I'd be interested in
> hearing which functions you think would particularly benefit from this.
> >
> > ~G
> >
> > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr>
> wrote:
> >>
> >> Hi Abby,
> >>
> >> Thank you for your positive feedback.
> >>
> >> I agree for your general comment about sorting.
> >>
> >> For ave specifically, ordering may not help because the output must
> maintain the order of the input (as ave returns only x and not the entiere
> data.frame).
> >>
> >> Thanks,
> >>
> >> Thomas
> >> ________________________________________
> >> De : Abby Spurdle <spurdle.a using gmail.com>
> >> Envoyé : lundi 15 mars 2021 10:22
> >> À : SOEIRO Thomas
> >> Cc : r-devel using r-project.org
> >> Objet : Re: [Rd] Potential improvements of ave?
> >>
> >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
> >>
> >> Hi Thomas,
> >>
> >> These are some great suggestions.
> >> But I can't help but feel there's a much bigger problem here.
> >>
> >> Intuitively, the ave function could (or should) sort the data.
> >> Then the indexing step becomes almost trivial, in terms of both time
> >> and space complexity.
> >> And the ave function is not the only example of where a problem
> >> becomes much simpler, if the data is sorted.
> >>
> >> Historically, I've never found base R functions user-friendly for
> >> aggregation purposes, or for sorting.
> >> (At least, not by comparison to SQL).
> >>
> >> But that's not the main problem.
> >> It would seem preferable to sort the data, only once.
> >> (Rather than sorting it repeatedly, or not at all).
> >>
> >> Perhaps, objects such as vectors and data.frame(s) could have a
> >> boolean attribute, to indicate if they're sorted.
> >> Or functions such as ave could have a sorted argument.
> >> In either case, if true, the function assumes the data is sorted and
> >> applies a more efficient algorithm.
> >>
> >>
> >> B.
> >>
> >>
> >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr>
> wrote:
> >> >
> >> > Dear all,
> >> >
> >> > I have two questions/suggestions about ave, but I am not sure if it's
> relevant for bug reports.
> >> >
> >> >
> >> >
> >> > 1) I have performance issues with ave in a case where I didn't expect
> it. The following code runs as expected:
> >> >
> >> > set.seed(1)
> >> >
> >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >> >                   id2 = sample(1:3, 5e2, TRUE),
> >> >                   id3 = sample(1:5, 5e2, TRUE),
> >> >                   val = sample(1:300, 5e2, TRUE))
> >> >
> >> > df1$diff <- ave(df1$val,
> >> >                 df1$id1,
> >> >                 df1$id2,
> >> >                 df1$id3,
> >> >                 FUN = function(i) c(diff(i), 0))
> >> >
> >> > head(df1[order(df1$id1,
> >> >                df1$id2,
> >> >                df1$id3), ])
> >> >
> >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
> allocate vector of size 1110.0 Gb):
> >> >
> >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
> >> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
> >> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
> >> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
> >> >
> >> > df2$diff <- ave(df2$val,
> >> >                 df2$id1,
> >> >                 df2$id2,
> >> >                 df2$id3,
> >> >                 FUN = function(i) c(diff(i), 0))
> >> >
> >> > This use case does not seem extreme to me (e.g. aggregate et al work
> perfectly on this data.frame).
> >> > So my question is: Is this expected/intended/reasonable? i.e. Does
> ave need to be optimized?
> >> >
> >> >
> >> >
> >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed
> to avoid warnings in case of unused levels (
> https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
> ).
> >> > Is it relevant/possible to expose the drop argument explicitly?
> >> >
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > Thomas
> >> > ______________________________________________
> >> > R-devel using r-project.org mailing list
> >> >
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
> >>
> >> ______________________________________________
> >> R-devel using r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list