[R] Improvement: function cut

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Sat Sep 18 00:57:27 CEST 2021


Perhaps you and Andrew should take this discussion off list...

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Sep 17, 2021 at 3:45 PM Leonard Mada via R-help
<r-help using r-project.org> wrote:
>
> Why would you want to merge different factors?
>
> It makes no sense on real data. Even if some names are the same, the
> factors are not the same!
>
>
> The only real-data application that springs to mind is censoring (right
> or left, depending on the choice): but here we have both open and closed
> intervals, e.g. to the right (in the same data-set).
>
>
> Leonard
>
>
> On 9/18/2021 1:29 AM, Andrew Simmons wrote:
> > I disagree, I don't really think it's too long or ugly, but if you
> > think it is, you could abbreviate it as 'i'.
> >
> >
> > x <- 0:20
> > breaks1 <- seq.int <http://seq.int>(0, 16, 4)
> > breaks2 <- seq.int <http://seq.int>(0, 20, 4)
> > data.frame(
> >     cut(x, breaks1, right = FALSE, i = TRUE),
> >     cut(x, breaks2, right = FALSE, i = TRUE),
> >     check.names = FALSE
> > )
> >
> >
> > I hope this helps.
> >
> > On Fri, Sep 17, 2021 at 6:26 PM Leonard Mada <leo.mada using syonic.eu
> > <mailto:leo.mada using syonic.eu>> wrote:
> >
> >     Hello Andrew,
> >
> >
> >     But "cut" generates factors. In most cases with real data one
> >     expects to have also the ends of the interval: the argument
> >     "include.lowest" is both ugly and too long.
> >
> >     [The test-code on the ftable thread contains this error! I have
> >     run through this error a couple of times.]
> >
> >
> >     The only real situation that I can imagine to be problematic:
> >
> >     - if the interval goes to +Inf (or -Inf): I do not know if there
> >     would be any effects when including +Inf (or -Inf).
> >
> >
> >     Leonard
> >
> >
> >     On 9/18/2021 1:14 AM, Andrew Simmons wrote:
> >>     While it is not explicitly mentioned anywhere in the
> >>     documentation for .bincode, I suspect 'include.lowest = FALSE' is
> >>     the default to keep the definitions of the bins consistent. For
> >>     example:
> >>
> >>
> >>     x <- 0:20
> >>     breaks1 <- seq.int <http://seq.int>(0, 16, 4)
> >>     breaks2 <- seq.int <http://seq.int>(0, 20, 4)
> >>     cbind(
> >>         .bincode(x, breaks1, right = FALSE, include.lowest = TRUE),
> >>         .bincode(x, breaks2, right = FALSE, include.lowest = TRUE)
> >>     )
> >>
> >>
> >>     by having 'include.lowest = TRUE' with different ends, you can
> >>     get inconsistent behaviour. While this probably wouldn't be an
> >>     issue with 'real' data, this would seem like something you'd want
> >>     to avoid by default. The definitions of the bins are
> >>
> >>
> >>     [0, 4)
> >>     [4, 8)
> >>     [8, 12)
> >>     [12, 16]
> >>
> >>
> >>     and
> >>
> >>
> >>     [0, 4)
> >>     [4, 8)
> >>     [8, 12)
> >>     [12, 16)
> >>     [16, 20]
> >>
> >>
> >>     so you can see where the inconsistent behaviour comes from. You
> >>     might be able to get R-core to add argument 'warn', but probably
> >>     not to change the default of 'include.lowest'. I hope this helps
> >>
> >>
> >>     On Fri, Sep 17, 2021 at 6:01 PM Leonard Mada <leo.mada using syonic.eu
> >>     <mailto:leo.mada using syonic.eu>> wrote:
> >>
> >>         Thank you Andrew.
> >>
> >>
> >>         Is there any reason not to make: include.lowest = TRUE the
> >>         default?
> >>
> >>
> >>         Regarding the NA:
> >>
> >>         The user still has to suspect that some values were not
> >>         included and run that test.
> >>
> >>
> >>         Leonard
> >>
> >>
> >>         On 9/18/2021 12:53 AM, Andrew Simmons wrote:
> >>>         Regarding your first point, argument 'include.lowest'
> >>>         already handles this specific case, see ?.bincode
> >>>
> >>>         Your second point, maybe it could be helpful, but since both
> >>>         'cut.default' and '.bincode' return NA if a value isn't
> >>>         within a bin, you could make something like this on your own.
> >>>         Might be worth pitching to R-bugs on the wishlist.
> >>>
> >>>
> >>>
> >>>         On Fri, Sep 17, 2021, 17:45 Leonard Mada via R-help
> >>>         <r-help using r-project.org <mailto:r-help using r-project.org>> wrote:
> >>>
> >>>             Hello List members,
> >>>
> >>>
> >>>             the following improvements would be useful for function
> >>>             cut (and .bincode):
> >>>
> >>>
> >>>             1.) Argument: Include extremes
> >>>             extremes = TRUE
> >>>             if(right == FALSE) {
> >>>                 # include also right for last interval;
> >>>             } else {
> >>>                 # include also left for first interval;
> >>>             }
> >>>
> >>>
> >>>             2.) Argument: warn = TRUE
> >>>
> >>>             Warn if any values are not included in the intervals.
> >>>
> >>>
> >>>             Motivation:
> >>>             - reduce risk of errors when using function cut();
> >>>
> >>>
> >>>             Sincerely,
> >>>
> >>>
> >>>             Leonard
> >>>
> >>>             ______________________________________________
> >>>             R-help using r-project.org <mailto:R-help using r-project.org>
> >>>             mailing list -- To UNSUBSCRIBE and more, see
> >>>             https://stat.ethz.ch/mailman/listinfo/r-help
> >>>             <https://stat.ethz.ch/mailman/listinfo/r-help>
> >>>             PLEASE do read the posting guide
> >>>             http://www.R-project.org/posting-guide.html
> >>>             <http://www.R-project.org/posting-guide.html>
> >>>             and provide commented, minimal, self-contained,
> >>>             reproducible code.
> >>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list