[Rd] the pipe |> and line breaks in pipelines

Timothy Goodman t|m@goodm@n @end|ng |rom gm@||@com
Wed Dec 9 21:45:51 CET 2020


Regarding special treatment for |>, isn't it getting special treatment
anyway, because it's implemented as a syntax transformation from x |> f(y)
to f(x, y), rather than as an operator?

That said, the point about wanting a block of code submitted line-by-line
to work the same as a block of code submitted all at once is a fair one.
Maybe the better solution would be if there were a way to say "Submit the
selected code as a single expression, ignoring line-breaks".  Then I could
run any number of lines with pipes at the start and no special character at
the end, and have it treated as a single pipeline.  I suppose that'd need
to be a feature offered by the environment (RStudio's RNotebooks in my
case).  I could wrap my pipelines in parentheses (to make the "pipes at
start of line" syntax valid R code), and then could use the hypothetical
"submit selected code ignoring line-breaks" feature when running just the
first part of the pipeline -- i.e., selecting full lines, but starting
after the opening paren so as not to need to insert a closing paren.

- Tim

On Wed, Dec 9, 2020 at 12:12 PM Duncan Murdoch <murdoch.duncan using gmail.com>
wrote:

> On 09/12/2020 2:33 p.m., Timothy Goodman wrote:
> > If I type my_data_frame_1 and press Enter (or Ctrl+Enter to execute the
> > command in the Notebook environment I'm using) I certainly *would*
> > expect R to treat it as a complete statement.
> >
> > But what I'm talking about is a different case, where I highlight a
> > multi-line statement in my notebook:
> >
> >      my_data_frame1
> >          |> filter(some_conditions_1)
> >
> > and then press Ctrl+Enter.
>
> I don't think I'd like it if parsing changed between passing one line at
> a time and passing a block of lines.  I'd like to be able to highlight a
> few lines and pass those, then type one, then highlight some more and
> pass those:  and have it act as though I just passed the whole combined
> block, or typed everything one line at a time.
>
>
>    Or, I suppose the equivalent would be to run
> > an R script containing those two lines of code, or to run a multi-line
> > statement like that from the console (which in RStudio I can do by
> > pressing Shift+Enter between the lines.)
> >
> > In those cases, R could either (1) Give an error message [the current
> > behavior], or (2) understand that the first line is meant to be piped to
> > the second.  The second option would be significantly more useful, and
> > is almost certainly what the user intended.
> >
> > (For what it's worth, there are some languages, such as Javascript, that
> > consider the first token of the next line when determining if the
> > previous line was complete.  JavaScript's rules around this are overly
> > complicated, but a rule like "a pipe following a line break is treated
> > as continuing the previous line" would be much simpler.  And while it
> > might be objectionable to treat the operator %>% different from other
> > operators, the addition of |>, which isn't truly an operator at all,
> > seems like the right time to consider it.)
>
> I think this would be hard to implement with R's current parser, but
> possible.  I think it could be done by distinguishing between EOL
> markers within a block of text and "end of block" marks.  If it applied
> only to the |> operator it would be *really* ugly.
>
> My strongest objection to it is the one at the top, though.  If I have a
> block of lines sitting in my editor that I just finished executing, with
> the cursor pointing at the next line, I'd like to know that it didn't
> matter whether the lines were passed one at a time, as a block, or some
> combination of those.
>
> Duncan Murdoch
>
> >
> > -Tim
> >
> > On Wed, Dec 9, 2020 at 3:12 AM Duncan Murdoch <murdoch.duncan using gmail.com
> > <mailto:murdoch.duncan using gmail.com>> wrote:
> >
> >     The requirement for operators at the end of the line comes from the
> >     interactive nature of R.  If you type
> >
> >           my_data_frame_1
> >
> >     how could R know that you are not done, and are planning to type the
> >     rest of the expression
> >
> >             %>% filter(some_conditions_1)
> >             ...
> >
> >     before it should consider the expression complete?  The way languages
> >     like C do this is by requiring a statement terminator at the end.
> You
> >     can also do it by wrapping the entire thing in parentheses ().
> >
> >     However, be careful: Don't use braces:  they don't work.  And parens
> >     have the side effect of removing invisibility from the result (which
> is
> >     a design flaw or bonus, depending on your point of view).  So I
> >     actually
> >     wouldn't advise this workaround.
> >
> >     Duncan Murdoch
> >
> >
> >     On 09/12/2020 12:45 a.m., Timothy Goodman wrote:
> >      > Hi,
> >      >
> >      > I'm a data scientist who routinely uses R in my day-to-day work,
> >     for tasks
> >      > such as cleaning and transforming data, exploratory data
> >     analysis, etc.
> >      > This includes frequent use of the pipe operator from the magrittr
> >     and dplyr
> >      > libraries, %>%.  So, I was pleased to hear about the recent work
> on a
> >      > native pipe operator, |>.
> >      >
> >      > This seems like a good time to bring up the main pain point I
> >     encounter
> >      > when using pipes in R, and some suggestions on what could be done
> >     about
> >      > it.  The issue is that the pipe operator can't be placed at the
> >     start of a
> >      > line of code (except in parentheses).  That's no different than
> >     any binary
> >      > operator in R, but I find it's a source of difficulty for the
> >     pipe because
> >      > of how pipes are often used.
> >      >
> >      > [I'm assuming here that my usage is fairly typical of a lot of
> >     users; at
> >      > any rate, I don't think I'm *too* unusual.]
> >      >
> >      > === Why this is a problem ===
> >      >
> >      > It's very common (for me, and I suspect for many users of dplyr)
> >     to write
> >      > multi-step pipelines and put each step on its own line for
> >     readability.
> >      > Something like this:
> >      >
> >      >    ### Example 1 ###
> >      >    my_data_frame_1 %>%
> >      >      filter(some_conditions_1) %>%
> >      >      inner_join(my_data_frame_2, by = some_columns_1) %>%
> >      >      group_by(some_columns_2) %>%
> >      >      summarize(some_aggregate_functions_1) %>%
> >      >      filter(some_conditions_2) %>%
> >      >      left_join(my_data_frame_3, by = some_columns_3) %>%
> >      >      group_by(some_columns_4) %>%
> >      >      summarize(some_aggregate_functions_2) %>%
> >      >      arrange(some_columns_5)
> >      >
> >      > [I guess some might consider this an overly long pipeline; for me
> >     it's
> >      > pretty typical.  I *could* split it up by assigning intermediate
> >     results to
> >      > variables, but much of the value I get from the pipe is that it
> >     lets my
> >      > code communicate which results are temporary, and which will be
> >     used again
> >      > later.  Assigning variables for single-use results would remove
> that
> >      > expressiveness.]
> >      >
> >      > I would prefer (for reasons I'll explain) to be able to write the
> >     above
> >      > example like this, which isn't valid R:
> >      >
> >      >    ### Example 2 (not valid R) ###
> >      >    my_data_frame_1
> >      >      %>% filter(some_conditions_1)
> >      >      %>% inner_join(my_data_frame_2, by = some_columns_1)
> >      >      %>% group_by(some_columns_2)
> >      >      %>% summarize(some_aggregate_functions_1)
> >      >      %>% filter(some_conditions_2)
> >      >      %>% left_join(my_data_frame_3, by = some_columns_3)
> >      >      %>% group_by(some_columns_4)
> >      >      %>% summarize(some_aggregate_functions_2)
> >      >      %>% arrange(some_columns_5)
> >      >
> >      > One (minor) advantage is obvious: It lets you easily line up the
> >     pipes,
> >      > which means that you can see at a glance that the whole block is
> >     a single
> >      > pipeline, and you'd immediately notice if you inadvertently
> >     omitted a pipe,
> >      > which otherwise can lead to confusing output.  [It's also
> >     aesthetically
> >      > pleasing, especially when %>% is replaced with |>, but that's
> >     subjective.]
> >      >
> >      > But the bigger issue happens when I want to re-run just *part* of
> the
> >      > pipeline.  I do this often when debugging: if the output of the
> >     pipeline
> >      > seems wrong, I re-run the first few steps and check the output,
> then
> >      > include a little more and re-run again, etc., until I locate my
> >     mistake.
> >      > Working in an interactive notebook environment, this involves
> >     using the
> >      > cursor to select just the part of the code I want to re-run.
> >      >
> >      > It's fast and easy to select *entire* lines of code, but
> >     unfortunately with
> >      > the pipes placed at the end of the line I must instead select
> >     everything
> >      > *except* the last three characters of the line (the last two
> >     characters for
> >      > the new pipe).  Then when I want to re-run the same partial
> >     pipeline with
> >      > the next line of code included, I can't just press SHIFT+Down to
> >     select it
> >      > as I otherwise would, but instead must move the cursor
> >     horizontally to a
> >      > position three characters before the end of *that* line (which is
> >     generally
> >      > different due to varying line lengths).  And so forth each time I
> >     want to
> >      > include an additional line.
> >      >
> >      > Moreover, with the staggered positions of the pipes at the end of
> >     each
> >      > line, it's very easy to accidentally select the final pipe on a
> >     line, and
> >      > then sit there for a moment wondering if the environment has
> stopped
> >      > responding before realizing it's just waiting for further input
> >     (i.e., for
> >      > the right-hand side).  These small delays and disruptions add up
> >     over the
> >      > course of a day.
> >      >
> >      > This desire to select and re-run the first part of a pipeline is
> >     also the
> >      > reason why it doesn't suffice to achieve syntax like my "Example
> >     2" by
> >      > wrapping the entire pipeline in parentheses.  That's of no use if
> >     I want to
> >      > re-run a selection that doesn't include the final close-paren.
> >      >
> >      > === Possible Solutions ===
> >      >
> >      > I can think of two, but maybe there are others.  The first would
> make
> >      > "Example 2" into valid code, and the second would allow you to
> run a
> >      > selection that included a trailing pipe.
> >      >
> >      >    Solution 1: Add a special case to how R is parsed, so if the
> first
> >      > (non-whitespace) token after an end-line is a pipe, that pipe
> >     gets moved to
> >      > before the end-line.
> >      >      - Argument for: This lets you write code like example 2,
> which
> >      > addresses the pain point around re-running part of a pipeline,
> >     and has
> >      > advantages for readability.  Also, since starting a line with a
> pipe
> >      > operator is currently invalid, the change wouldn't break any
> >     working code.
> >      >      - Argument against: It would make the behavior of %>%
> >     inconsistent with
> >      > that of other binary operators in R.  (However, this objection
> >     might not
> >      > apply to the new pipe, |>, which I understand is being
> >     implemented as a
> >      > syntax transformation rather than a binary operator.)
> >      >
> >      >    Solution 2: Ignore the pipe operator if it occurs as the final
> >     token of
> >      > the code being executed.
> >      >      - Argument for: This would mean the user could select and
> >     re-run the
> >      > first few lines of a longer pipeline (selecting *entire* lines),
> >     avoiding
> >      > the difficulties described above.
> >      >      - Argument against: This means that %>% would be valid even
> >     if it
> >      > occurred without a right-hand side, which is inconsistent with
> other
> >      > operators in R.  (But, as above, this objection might not apply
> >     to |>.)
> >      > Also, this solution still doesn't enable the syntax of "Example
> >     2", with
> >      > its readability benefit.
> >      >
> >      > Thanks for reading this and considering it.
> >      >
> >      > - Tim Goodman
> >      >
> >      >       [[alternative HTML version deleted]]
> >      >
> >      > ______________________________________________
> >      > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
> >      > https://stat.ethz.ch/mailman/listinfo/r-devel
> >     <https://stat.ethz.ch/mailman/listinfo/r-devel>
> >      >
> >
>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list