[Rd] Potential improvements of ave?
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Tue Mar 16 10:50:10 CET 2021
>>>>> Gabriel Becker
>>>>> on Mon, 15 Mar 2021 15:08:44 -0700 writes:
> Abby,
> Vectors do have an internal mechanism for knowing that they are sorted via
> ALTREP (it was one of 2 core motivating features for 'smart vectors' the
> other being knowledge about presence of NAs).
> Currently I don't think we expose it at the R level, though it is part of
> the official C API. I don't know of any plans for this to change, but I
> suppose it could. Plus for functions in R itself, we could even use it
> without exposing it more widely. A number of functions, including sort
> itself, already do this in fact, but more could. I'd be interested in
> hearing which functions you think would particularly benefit from this.
Thank you Gabe.
> ~G
I vaguely remember (from Luke's docs/presentation on ALTREP)
that there are some "missing parts" here.
One of them the not-existing R level functionality, another may be
the C code below R's is.unsorted() ... maybe is.unsorted()
could get a new argument and or be re-written, moving the NA
handling also to C and have that happen *after* the C code looks
if it's an ALTREP object and if that "knows it's sorted".
Martin
> On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr>
> wrote:
>> Hi Abby,
>>
>> Thank you for your positive feedback.
>>
>> I agree for your general comment about sorting.
>>
>> For ave specifically, ordering may not help because the output must
>> maintain the order of the input (as ave returns only x and not the entiere
>> data.frame).
>>
>> Thanks,
>>
>> Thomas
>> ________________________________________
>> De : Abby Spurdle <spurdle.a using gmail.com>
>> Envoyé : lundi 15 mars 2021 10:22
>> À : SOEIRO Thomas
>> Cc : r-devel using r-project.org
>> Objet : Re: [Rd] Potential improvements of ave?
>>
>> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
>>
>> Hi Thomas,
>>
>> These are some great suggestions.
>> But I can't help but feel there's a much bigger problem here.
>>
>> Intuitively, the ave function could (or should) sort the data.
>> Then the indexing step becomes almost trivial, in terms of both time
>> and space complexity.
>> And the ave function is not the only example of where a problem
>> becomes much simpler, if the data is sorted.
>>
>> Historically, I've never found base R functions user-friendly for
>> aggregation purposes, or for sorting.
>> (At least, not by comparison to SQL).
>>
>> But that's not the main problem.
>> It would seem preferable to sort the data, only once.
>> (Rather than sorting it repeatedly, or not at all).
>>
>> Perhaps, objects such as vectors and data.frame(s) could have a
>> boolean attribute, to indicate if they're sorted.
>> Or functions such as ave could have a sorted argument.
>> In either case, if true, the function assumes the data is sorted and
>> applies a more efficient algorithm.
>>
>>
>> B.
>>
>>
>> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr>
>> wrote:
>> >
>> > Dear all,
>> >
>> > I have two questions/suggestions about ave, but I am not sure if it's
>> relevant for bug reports.
>> >
>> >
>> >
>> > 1) I have performance issues with ave in a case where I didn't expect
>> it. The following code runs as expected:
>> >
>> > set.seed(1)
>> >
>> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>> > id2 = sample(1:3, 5e2, TRUE),
>> > id3 = sample(1:5, 5e2, TRUE),
>> > val = sample(1:300, 5e2, TRUE))
>> >
>> > df1$diff <- ave(df1$val,
>> > df1$id1,
>> > df1$id2,
>> > df1$id3,
>> > FUN = function(i) c(diff(i), 0))
>> >
>> > head(df1[order(df1$id1,
>> > df1$id2,
>> > df1$id3), ])
>> >
>> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
>> allocate vector of size 1110.0 Gb):
>> >
>> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>> > id2 = sample(1:3, 5e2 * 1e4, TRUE),
>> > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>> > val = sample(1:300, 5e2 * 1e4, TRUE))
>> >
>> > df2$diff <- ave(df2$val,
>> > df2$id1,
>> > df2$id2,
>> > df2$id3,
>> > FUN = function(i) c(diff(i), 0))
>> >
>> > This use case does not seem extreme to me (e.g. aggregate et al work
>> perfectly on this data.frame).
>> > So my question is: Is this expected/intended/reasonable? i.e. Does ave
>> need to be optimized?
>> >
>> >
>> >
>> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to
>> avoid warnings in case of unused levels (
>> https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
>> ).
>> > Is it relevant/possible to expose the drop argument explicitly?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Thomas
>> > ______________________________________________
>> > R-devel using r-project.org mailing list
>> >
>> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> [[alternative HTML version deleted]]
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list