[R] Improvement: function cut

David Winsemius dw|n@em|u@ @end|ng |rom comc@@t@net
Sat Sep 18 22:43:31 CEST 2021


On 9/18/21 5:28 AM, Leonard Mada via R-help wrote:
> Hello Andrew,
>
>
> I add this info as a completion (so other users can get a better
> understanding):
>
> If we want to perform a survival analysis, than the interval should be
> closed to the right, but we should include also the first time point (as
> per Intention-to-Treat):
>
> [0, 4](4, 8](8, 12](12, 16]
>
> [0, 4](4, 8](8, 12](12, 16](16, 20]
>
>
> So the series is extendible to the right without any errors!
>
> But the 1st interval (which is the same in both series) is different
> from the other intervals: [0, 4].
>
>
> I feel that this should have been the default behaviour for cut().

To Leonard;

If you do not like the behavior of `cut`, then you should "roll your 
own". It's very unlikely that R Core will modify a base cunction like 
cut. You might want to look at Hmisc::cut2. Frank Harrell didn't like 
that default behavior and thought he could make a better cut, so he just 
put it in his package. I did like his version better and often used it 
when I was actively programming. I suspect there is also a tidyverse 
cut-like function, but I'm not terribly familiar with that fork of R. 
(It's really not the same language IMHO.)

But it's a waste of time and energy to try propose modifications of core 
R functions unless *you* can show that it is stable across 20,000 
packages and will not offend long-time users. The likelihood  of that 
happening for your proposal is vanishing small in my estimation. You 
shouldn't ask R Core to do that for you. They are busy fixing real bugs.


If you want to persist despite my negativity, then you should make a 
complete proposal by submitting a proper diff file that incorporates 
your tested efforts to the Rdevel mailing list.


-- 

David

>
> Note:
>
> I was induced to think about a different situation in my previous
> message, as you constructed open intervals on the right, and also
> extended to the right. But survival analysis should be as described in
> this mail and should probably be the default.
>
>
> Sincerely,
>
>
> Leonard
>
>
> On 9/18/2021 1:29 AM, Andrew Simmons wrote:
>> I disagree, I don't really think it's too long or ugly, but if you
>> think it is, you could abbreviate it as 'i'.
>>
>>
>> x <- 0:20
>> breaks1 <- seq.int <http://seq.int>(0, 16, 4)
>> breaks2 <- seq.int <http://seq.int>(0, 20, 4)
>> data.frame(
>>      cut(x, breaks1, right = FALSE, i = TRUE),
>>      cut(x, breaks2, right = FALSE, i = TRUE),
>>      check.names = FALSE
>> )
>>
>>
>> I hope this helps.
>>
>> On Fri, Sep 17, 2021 at 6:26 PM Leonard Mada <leo.mada using syonic.eu
>> <mailto:leo.mada using syonic.eu>> wrote:
>>
>>      Hello Andrew,
>>
>>
>>      But "cut" generates factors. In most cases with real data one
>>      expects to have also the ends of the interval: the argument
>>      "include.lowest" is both ugly and too long.
>>
>>      [The test-code on the ftable thread contains this error! I have
>>      run through this error a couple of times.]
>>
>>
>>      The only real situation that I can imagine to be problematic:
>>
>>      - if the interval goes to +Inf (or -Inf): I do not know if there
>>      would be any effects when including +Inf (or -Inf).
>>
>>
>>      Leonard
>>
>>
>>      On 9/18/2021 1:14 AM, Andrew Simmons wrote:
>>>      While it is not explicitly mentioned anywhere in the
>>>      documentation for .bincode, I suspect 'include.lowest = FALSE' is
>>>      the default to keep the definitions of the bins consistent. For
>>>      example:
>>>
>>>
>>>      x <- 0:20
>>>      breaks1 <- seq.int <http://seq.int>(0, 16, 4)
>>>      breaks2 <- seq.int <http://seq.int>(0, 20, 4)
>>>      cbind(
>>>          .bincode(x, breaks1, right = FALSE, include.lowest = TRUE),
>>>          .bincode(x, breaks2, right = FALSE, include.lowest = TRUE)
>>>      )
>>>
>>>
>>>      by having 'include.lowest = TRUE' with different ends, you can
>>>      get inconsistent behaviour. While this probably wouldn't be an
>>>      issue with 'real' data, this would seem like something you'd want
>>>      to avoid by default. The definitions of the bins are
>>>
>>>
>>>      [0, 4)
>>>      [4, 8)
>>>      [8, 12)
>>>      [12, 16]
>>>
>>>
>>>      and
>>>
>>>
>>>      [0, 4)
>>>      [4, 8)
>>>      [8, 12)
>>>      [12, 16)
>>>      [16, 20]
>>>
>>>
>>>      so you can see where the inconsistent behaviour comes from. You
>>>      might be able to get R-core to add argument 'warn', but probably
>>>      not to change the default of 'include.lowest'. I hope this helps
>>>
>>>
>>>      On Fri, Sep 17, 2021 at 6:01 PM Leonard Mada <leo.mada using syonic.eu
>>>      <mailto:leo.mada using syonic.eu>> wrote:
>>>
>>>          Thank you Andrew.
>>>
>>>
>>>          Is there any reason not to make: include.lowest = TRUE the
>>>          default?
>>>
>>>
>>>          Regarding the NA:
>>>
>>>          The user still has to suspect that some values were not
>>>          included and run that test.
>>>
>>>
>>>          Leonard
>>>
>>>
>>>          On 9/18/2021 12:53 AM, Andrew Simmons wrote:
>>>>          Regarding your first point, argument 'include.lowest'
>>>>          already handles this specific case, see ?.bincode
>>>>
>>>>          Your second point, maybe it could be helpful, but since both
>>>>          'cut.default' and '.bincode' return NA if a value isn't
>>>>          within a bin, you could make something like this on your own.
>>>>          Might be worth pitching to R-bugs on the wishlist.
>>>>
>>>>
>>>>
>>>>          On Fri, Sep 17, 2021, 17:45 Leonard Mada via R-help
>>>>          <r-help using r-project.org <mailto:r-help using r-project.org>> wrote:
>>>>
>>>>              Hello List members,
>>>>
>>>>
>>>>              the following improvements would be useful for function
>>>>              cut (and .bincode):
>>>>
>>>>
>>>>              1.) Argument: Include extremes
>>>>              extremes = TRUE
>>>>              if(right == FALSE) {
>>>>                  # include also right for last interval;
>>>>              } else {
>>>>                  # include also left for first interval;
>>>>              }
>>>>
>>>>
>>>>              2.) Argument: warn = TRUE
>>>>
>>>>              Warn if any values are not included in the intervals.
>>>>
>>>>
>>>>              Motivation:
>>>>              - reduce risk of errors when using function cut();
>>>>
>>>>
>>>>              Sincerely,
>>>>
>>>>
>>>>              Leonard
>>>>
>>>>              ______________________________________________
>>>>              R-help using r-project.org <mailto:R-help using r-project.org>
>>>>              mailing list -- To UNSUBSCRIBE and more, see
>>>>              https://stat.ethz.ch/mailman/listinfo/r-help
>>>>              <https://stat.ethz.ch/mailman/listinfo/r-help>
>>>>              PLEASE do read the posting guide
>>>>              http://www.R-project.org/posting-guide.html
>>>>              <http://www.R-project.org/posting-guide.html>
>>>>              and provide commented, minimal, self-contained,
>>>>              reproducible code.
>>>>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list