[Rd] the pipe |> and line breaks in pipelines

Wed Dec 9 21:12:48 CET 2020

On 09/12/2020 2:33 p.m., Timothy Goodman wrote:
> If I type my_data_frame_1 and press Enter (or Ctrl+Enter to execute the 
> command in the Notebook environment I'm using) I certainly *would* 
> expect R to treat it as a complete statement.
> 
> But what I'm talking about is a different case, where I highlight a 
> multi-line statement in my notebook:
> 
>      my_data_frame1
>          |> filter(some_conditions_1)
> 
> and then press Ctrl+Enter.

I don't think I'd like it if parsing changed between passing one line at 
a time and passing a block of lines.  I'd like to be able to highlight a 
few lines and pass those, then type one, then highlight some more and 
pass those:  and have it act as though I just passed the whole combined 
block, or typed everything one line at a time.

   Or, I suppose the equivalent would be to run
> an R script containing those two lines of code, or to run a multi-line 
> statement like that from the console (which in RStudio I can do by 
> pressing Shift+Enter between the lines.)
> 
> In those cases, R could either (1) Give an error message [the current 
> behavior], or (2) understand that the first line is meant to be piped to 
> the second.  The second option would be significantly more useful, and 
> is almost certainly what the user intended.
> 
> (For what it's worth, there are some languages, such as Javascript, that 
> consider the first token of the next line when determining if the 
> previous line was complete.  JavaScript's rules around this are overly 
> complicated, but a rule like "a pipe following a line break is treated 
> as continuing the previous line" would be much simpler.  And while it 
> might be objectionable to treat the operator %>% different from other 
> operators, the addition of |>, which isn't truly an operator at all, 
> seems like the right time to consider it.)

I think this would be hard to implement with R's current parser, but 
possible.  I think it could be done by distinguishing between EOL 
markers within a block of text and "end of block" marks.  If it applied 
only to the |> operator it would be *really* ugly.

My strongest objection to it is the one at the top, though.  If I have a 
block of lines sitting in my editor that I just finished executing, with 
the cursor pointing at the next line, I'd like to know that it didn't 
matter whether the lines were passed one at a time, as a block, or some 
combination of those.

Duncan Murdoch

> 
> -Tim
> 
> On Wed, Dec 9, 2020 at 3:12 AM Duncan Murdoch <murdoch.duncan using gmail.com 
> <mailto:murdoch.duncan using gmail.com>> wrote:
> 
>     The requirement for operators at the end of the line comes from the
>     interactive nature of R.  If you type
> 
>           my_data_frame_1
> 
>     how could R know that you are not done, and are planning to type the
>     rest of the expression
> 
>             %>% filter(some_conditions_1)
>             ...
> 
>     before it should consider the expression complete?  The way languages
>     like C do this is by requiring a statement terminator at the end.  You
>     can also do it by wrapping the entire thing in parentheses ().
> 
>     However, be careful: Don't use braces:  they don't work.  And parens
>     have the side effect of removing invisibility from the result (which is
>     a design flaw or bonus, depending on your point of view).  So I
>     actually
>     wouldn't advise this workaround.
> 
>     Duncan Murdoch
> 
> 
>     On 09/12/2020 12:45 a.m., Timothy Goodman wrote:
>      > Hi,
>      >
>      > I'm a data scientist who routinely uses R in my day-to-day work,
>     for tasks
>      > such as cleaning and transforming data, exploratory data
>     analysis, etc.
>      > This includes frequent use of the pipe operator from the magrittr
>     and dplyr
>      > libraries, %>%.  So, I was pleased to hear about the recent work on a
>      > native pipe operator, |>.
>      >
>      > This seems like a good time to bring up the main pain point I
>     encounter
>      > when using pipes in R, and some suggestions on what could be done
>     about
>      > it.  The issue is that the pipe operator can't be placed at the
>     start of a
>      > line of code (except in parentheses).  That's no different than
>     any binary
>      > operator in R, but I find it's a source of difficulty for the
>     pipe because
>      > of how pipes are often used.
>      >
>      > [I'm assuming here that my usage is fairly typical of a lot of
>     users; at
>      > any rate, I don't think I'm *too* unusual.]
>      >
>      > === Why this is a problem ===
>      >
>      > It's very common (for me, and I suspect for many users of dplyr)
>     to write
>      > multi-step pipelines and put each step on its own line for
>     readability.
>      > Something like this:
>      >
>      >    ### Example 1 ###
>      >    my_data_frame_1 %>%
>      >      filter(some_conditions_1) %>%
>      >      inner_join(my_data_frame_2, by = some_columns_1) %>%
>      >      group_by(some_columns_2) %>%
>      >      summarize(some_aggregate_functions_1) %>%
>      >      filter(some_conditions_2) %>%
>      >      left_join(my_data_frame_3, by = some_columns_3) %>%
>      >      group_by(some_columns_4) %>%
>      >      summarize(some_aggregate_functions_2) %>%
>      >      arrange(some_columns_5)
>      >
>      > [I guess some might consider this an overly long pipeline; for me
>     it's
>      > pretty typical.  I *could* split it up by assigning intermediate
>     results to
>      > variables, but much of the value I get from the pipe is that it
>     lets my
>      > code communicate which results are temporary, and which will be
>     used again
>      > later.  Assigning variables for single-use results would remove that
>      > expressiveness.]
>      >
>      > I would prefer (for reasons I'll explain) to be able to write the
>     above
>      > example like this, which isn't valid R:
>      >
>      >    ### Example 2 (not valid R) ###
>      >    my_data_frame_1
>      >      %>% filter(some_conditions_1)
>      >      %>% inner_join(my_data_frame_2, by = some_columns_1)
>      >      %>% group_by(some_columns_2)
>      >      %>% summarize(some_aggregate_functions_1)
>      >      %>% filter(some_conditions_2)
>      >      %>% left_join(my_data_frame_3, by = some_columns_3)
>      >      %>% group_by(some_columns_4)
>      >      %>% summarize(some_aggregate_functions_2)
>      >      %>% arrange(some_columns_5)
>      >
>      > One (minor) advantage is obvious: It lets you easily line up the
>     pipes,
>      > which means that you can see at a glance that the whole block is
>     a single
>      > pipeline, and you'd immediately notice if you inadvertently
>     omitted a pipe,
>      > which otherwise can lead to confusing output.  [It's also
>     aesthetically
>      > pleasing, especially when %>% is replaced with |>, but that's
>     subjective.]
>      >
>      > But the bigger issue happens when I want to re-run just *part* of the
>      > pipeline.  I do this often when debugging: if the output of the
>     pipeline
>      > seems wrong, I re-run the first few steps and check the output, then
>      > include a little more and re-run again, etc., until I locate my
>     mistake.
>      > Working in an interactive notebook environment, this involves
>     using the
>      > cursor to select just the part of the code I want to re-run.
>      >
>      > It's fast and easy to select *entire* lines of code, but
>     unfortunately with
>      > the pipes placed at the end of the line I must instead select
>     everything
>      > *except* the last three characters of the line (the last two
>     characters for
>      > the new pipe).  Then when I want to re-run the same partial
>     pipeline with
>      > the next line of code included, I can't just press SHIFT+Down to
>     select it
>      > as I otherwise would, but instead must move the cursor
>     horizontally to a
>      > position three characters before the end of *that* line (which is
>     generally
>      > different due to varying line lengths).  And so forth each time I
>     want to
>      > include an additional line.
>      >
>      > Moreover, with the staggered positions of the pipes at the end of
>     each
>      > line, it's very easy to accidentally select the final pipe on a
>     line, and
>      > then sit there for a moment wondering if the environment has stopped
>      > responding before realizing it's just waiting for further input
>     (i.e., for
>      > the right-hand side).  These small delays and disruptions add up
>     over the
>      > course of a day.
>      >
>      > This desire to select and re-run the first part of a pipeline is
>     also the
>      > reason why it doesn't suffice to achieve syntax like my "Example
>     2" by
>      > wrapping the entire pipeline in parentheses.  That's of no use if
>     I want to
>      > re-run a selection that doesn't include the final close-paren.
>      >
>      > === Possible Solutions ===
>      >
>      > I can think of two, but maybe there are others.  The first would make
>      > "Example 2" into valid code, and the second would allow you to run a
>      > selection that included a trailing pipe.
>      >
>      >    Solution 1: Add a special case to how R is parsed, so if the first
>      > (non-whitespace) token after an end-line is a pipe, that pipe
>     gets moved to
>      > before the end-line.
>      >      - Argument for: This lets you write code like example 2, which
>      > addresses the pain point around re-running part of a pipeline,
>     and has
>      > advantages for readability.  Also, since starting a line with a pipe
>      > operator is currently invalid, the change wouldn't break any
>     working code.
>      >      - Argument against: It would make the behavior of %>%
>     inconsistent with
>      > that of other binary operators in R.  (However, this objection
>     might not
>      > apply to the new pipe, |>, which I understand is being
>     implemented as a
>      > syntax transformation rather than a binary operator.)
>      >
>      >    Solution 2: Ignore the pipe operator if it occurs as the final
>     token of
>      > the code being executed.
>      >      - Argument for: This would mean the user could select and
>     re-run the
>      > first few lines of a longer pipeline (selecting *entire* lines),
>     avoiding
>      > the difficulties described above.
>      >      - Argument against: This means that %>% would be valid even
>     if it
>      > occurred without a right-hand side, which is inconsistent with other
>      > operators in R.  (But, as above, this objection might not apply
>     to |>.)
>      > Also, this solution still doesn't enable the syntax of "Example
>     2", with
>      > its readability benefit.
>      >
>      > Thanks for reading this and considering it.
>      >
>      > - Tim Goodman
>      >
>      >       [[alternative HTML version deleted]]
>      >
>      > ______________________________________________
>      > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
>      > https://stat.ethz.ch/mailman/listinfo/r-devel
>     <https://stat.ethz.ch/mailman/listinfo/r-devel>
>      >
>