[Rd] Potential improvements of ave?

Tue Mar 16 10:50:10 CET 2021

>>>>> Gabriel Becker 
>>>>>     on Mon, 15 Mar 2021 15:08:44 -0700 writes:

    > Abby,
    > Vectors do have an internal mechanism for knowing that they are sorted via
    > ALTREP (it was one of 2 core motivating features for 'smart vectors' the
    > other being knowledge about presence of NAs).

    > Currently I don't think we expose it at the R level, though it is part of
    > the official C API. I don't know of any plans for this to change, but I
    > suppose it could. Plus for functions in R itself, we could even use it
    > without exposing it more widely. A number of functions, including sort
    > itself, already do this in fact, but more could. I'd be interested in
    > hearing which functions you think would particularly benefit from this.

Thank you Gabe.

    > ~G

I vaguely remember (from Luke's docs/presentation on ALTREP)
that there are some "missing parts" here.
One of them the not-existing R level functionality, another may be
the C code below R's  is.unsorted()  ... maybe  is.unsorted()
could get a new argument and or be re-written, moving  the NA
handling also to C and have that happen *after* the C code looks
if it's an ALTREP object and if that "knows it's sorted".

Martin

    > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr>
    > wrote:

    >> Hi Abby,
    >> 
    >> Thank you for your positive feedback.
    >> 
    >> I agree for your general comment about sorting.
    >> 
    >> For ave specifically, ordering may not help because the output must
    >> maintain the order of the input (as ave returns only x and not the entiere
    >> data.frame).
    >> 
    >> Thanks,
    >> 
    >> Thomas
    >> ________________________________________
    >> De : Abby Spurdle <spurdle.a using gmail.com>
    >> Envoyé : lundi 15 mars 2021 10:22
    >> À : SOEIRO Thomas
    >> Cc : r-devel using r-project.org
    >> Objet : Re: [Rd] Potential improvements of ave?
    >> 
    >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
    >> 
    >> Hi Thomas,
    >> 
    >> These are some great suggestions.
    >> But I can't help but feel there's a much bigger problem here.
    >> 
    >> Intuitively, the ave function could (or should) sort the data.
    >> Then the indexing step becomes almost trivial, in terms of both time
    >> and space complexity.
    >> And the ave function is not the only example of where a problem
    >> becomes much simpler, if the data is sorted.
    >> 
    >> Historically, I've never found base R functions user-friendly for
    >> aggregation purposes, or for sorting.
    >> (At least, not by comparison to SQL).
    >> 
    >> But that's not the main problem.
    >> It would seem preferable to sort the data, only once.
    >> (Rather than sorting it repeatedly, or not at all).
    >> 
    >> Perhaps, objects such as vectors and data.frame(s) could have a
    >> boolean attribute, to indicate if they're sorted.
    >> Or functions such as ave could have a sorted argument.
    >> In either case, if true, the function assumes the data is sorted and
    >> applies a more efficient algorithm.
    >> 
    >> 
    >> B.
    >> 
    >> 
    >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <Thomas.SOEIRO using ap-hm.fr>
    >> wrote:
    >> >
    >> > Dear all,
    >> >
    >> > I have two questions/suggestions about ave, but I am not sure if it's
    >> relevant for bug reports.
    >> >
    >> >
    >> >
    >> > 1) I have performance issues with ave in a case where I didn't expect
    >> it. The following code runs as expected:
    >> >
    >> > set.seed(1)
    >> >
    >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
    >> >                   id2 = sample(1:3, 5e2, TRUE),
    >> >                   id3 = sample(1:5, 5e2, TRUE),
    >> >                   val = sample(1:300, 5e2, TRUE))
    >> >
    >> > df1$diff <- ave(df1$val,
    >> >                 df1$id1,
    >> >                 df1$id2,
    >> >                 df1$id3,
    >> >                 FUN = function(i) c(diff(i), 0))
    >> >
    >> > head(df1[order(df1$id1,
    >> >                df1$id2,
    >> >                df1$id3), ])
    >> >
    >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
    >> allocate vector of size 1110.0 Gb):
    >> >
    >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
    >> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
    >> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
    >> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
    >> >
    >> > df2$diff <- ave(df2$val,
    >> >                 df2$id1,
    >> >                 df2$id2,
    >> >                 df2$id3,
    >> >                 FUN = function(i) c(diff(i), 0))
    >> >
    >> > This use case does not seem extreme to me (e.g. aggregate et al work
    >> perfectly on this data.frame).
    >> > So my question is: Is this expected/intended/reasonable? i.e. Does ave
    >> need to be optimized?
    >> >
    >> >
    >> >
    >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to
    >> avoid warnings in case of unused levels (
    >> https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
    >> ).
    >> > Is it relevant/possible to expose the drop argument explicitly?
    >> >
    >> >
    >> >
    >> > Thanks,
    >> >
    >> > Thomas
    >> > ______________________________________________
    >> > R-devel using r-project.org mailing list
    >> >
    >> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
    >> 
    >> ______________________________________________
    >> R-devel using r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> 

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-devel using r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel