[Rd] Potential improvements of ave?

Abby Spurdle @purd|e@@ @end|ng |rom gm@||@com
Tue Mar 16 22:53:46 CET 2021


There are some relatively obvious examples:
unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split

Also, many timeseries, graphics and spline functions are dependent on the order.

In the case of data.frame(s), a boolean flag would probably need to be
extended to allow for multiple column sorting, and
ascending/descending options.

On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <gabembecker using gmail.com> wrote:
>
> Abby,
>
> Vectors do have an internal mechanism for knowing that they are sorted via ALTREP (it was one of 2 core motivating features for 'smart vectors' the other being knowledge about presence of NAs).
>
> Currently I don't think we expose it at the R level, though it is part of the official C API. I don't know of any plans for this to change, but I suppose it could. Plus for functions in R itself, we could even use it without exposing it more widely. A number of functions, including sort itself, already do this in fact, but more could. I'd be interested in hearing which functions you think would particularly benefit from this.
>
> ~G
>
> On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr> wrote:
>>
>> Hi Abby,
>>
>> Thank you for your positive feedback.
>>
>> I agree for your general comment about sorting.
>>
>> For ave specifically, ordering may not help because the output must maintain the order of the input (as ave returns only x and not the entiere data.frame).
>>
>> Thanks,
>>
>> Thomas
>> ________________________________________
>> De : Abby Spurdle <spurdle.a using gmail.com>
>> Envoyé : lundi 15 mars 2021 10:22
>> À : SOEIRO Thomas
>> Cc : r-devel using r-project.org
>> Objet : Re: [Rd] Potential improvements of ave?
>>
>> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
>>
>> Hi Thomas,
>>
>> These are some great suggestions.
>> But I can't help but feel there's a much bigger problem here.
>>
>> Intuitively, the ave function could (or should) sort the data.
>> Then the indexing step becomes almost trivial, in terms of both time
>> and space complexity.
>> And the ave function is not the only example of where a problem
>> becomes much simpler, if the data is sorted.
>>
>> Historically, I've never found base R functions user-friendly for
>> aggregation purposes, or for sorting.
>> (At least, not by comparison to SQL).
>>
>> But that's not the main problem.
>> It would seem preferable to sort the data, only once.
>> (Rather than sorting it repeatedly, or not at all).
>>
>> Perhaps, objects such as vectors and data.frame(s) could have a
>> boolean attribute, to indicate if they're sorted.
>> Or functions such as ave could have a sorted argument.
>> In either case, if true, the function assumes the data is sorted and
>> applies a more efficient algorithm.
>>
>>
>> B.
>>
>>
>> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr> wrote:
>> >
>> > Dear all,
>> >
>> > I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.
>> >
>> >
>> >
>> > 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:
>> >
>> > set.seed(1)
>> >
>> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>> >                   id2 = sample(1:3, 5e2, TRUE),
>> >                   id3 = sample(1:5, 5e2, TRUE),
>> >                   val = sample(1:300, 5e2, TRUE))
>> >
>> > df1$diff <- ave(df1$val,
>> >                 df1$id1,
>> >                 df1$id2,
>> >                 df1$id3,
>> >                 FUN = function(i) c(diff(i), 0))
>> >
>> > head(df1[order(df1$id1,
>> >                df1$id2,
>> >                df1$id3), ])
>> >
>> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):
>> >
>> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
>> >
>> > df2$diff <- ave(df2$val,
>> >                 df2$id1,
>> >                 df2$id2,
>> >                 df2$id3,
>> >                 FUN = function(i) c(diff(i), 0))
>> >
>> > This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
>> > So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?
>> >
>> >
>> >
>> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ ).
>> > Is it relevant/possible to expose the drop argument explicitly?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Thomas
>> > ______________________________________________
>> > R-devel using r-project.org mailing list
>> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list