[Rd] the pipe |> and line breaks in pipelines
Duncan Murdoch
murdoch@dunc@n @end|ng |rom gm@||@com
Wed Dec 9 21:12:48 CET 2020
On 09/12/2020 2:33 p.m., Timothy Goodman wrote:
> If I type my_data_frame_1 and press Enter (or Ctrl+Enter to execute the
> command in the Notebook environment I'm using) I certainly *would*
> expect R to treat it as a complete statement.
>
> But what I'm talking about is a different case, where I highlight a
> multi-line statement in my notebook:
>
> my_data_frame1
> |> filter(some_conditions_1)
>
> and then press Ctrl+Enter.
I don't think I'd like it if parsing changed between passing one line at
a time and passing a block of lines. I'd like to be able to highlight a
few lines and pass those, then type one, then highlight some more and
pass those: and have it act as though I just passed the whole combined
block, or typed everything one line at a time.
Or, I suppose the equivalent would be to run
> an R script containing those two lines of code, or to run a multi-line
> statement like that from the console (which in RStudio I can do by
> pressing Shift+Enter between the lines.)
>
> In those cases, R could either (1) Give an error message [the current
> behavior], or (2) understand that the first line is meant to be piped to
> the second. The second option would be significantly more useful, and
> is almost certainly what the user intended.
>
> (For what it's worth, there are some languages, such as Javascript, that
> consider the first token of the next line when determining if the
> previous line was complete. JavaScript's rules around this are overly
> complicated, but a rule like "a pipe following a line break is treated
> as continuing the previous line" would be much simpler. And while it
> might be objectionable to treat the operator %>% different from other
> operators, the addition of |>, which isn't truly an operator at all,
> seems like the right time to consider it.)
I think this would be hard to implement with R's current parser, but
possible. I think it could be done by distinguishing between EOL
markers within a block of text and "end of block" marks. If it applied
only to the |> operator it would be *really* ugly.
My strongest objection to it is the one at the top, though. If I have a
block of lines sitting in my editor that I just finished executing, with
the cursor pointing at the next line, I'd like to know that it didn't
matter whether the lines were passed one at a time, as a block, or some
combination of those.
Duncan Murdoch
>
> -Tim
>
> On Wed, Dec 9, 2020 at 3:12 AM Duncan Murdoch <murdoch.duncan using gmail.com
> <mailto:murdoch.duncan using gmail.com>> wrote:
>
> The requirement for operators at the end of the line comes from the
> interactive nature of R. If you type
>
> my_data_frame_1
>
> how could R know that you are not done, and are planning to type the
> rest of the expression
>
> %>% filter(some_conditions_1)
> ...
>
> before it should consider the expression complete? The way languages
> like C do this is by requiring a statement terminator at the end. You
> can also do it by wrapping the entire thing in parentheses ().
>
> However, be careful: Don't use braces: they don't work. And parens
> have the side effect of removing invisibility from the result (which is
> a design flaw or bonus, depending on your point of view). So I
> actually
> wouldn't advise this workaround.
>
> Duncan Murdoch
>
>
> On 09/12/2020 12:45 a.m., Timothy Goodman wrote:
> > Hi,
> >
> > I'm a data scientist who routinely uses R in my day-to-day work,
> for tasks
> > such as cleaning and transforming data, exploratory data
> analysis, etc.
> > This includes frequent use of the pipe operator from the magrittr
> and dplyr
> > libraries, %>%. So, I was pleased to hear about the recent work on a
> > native pipe operator, |>.
> >
> > This seems like a good time to bring up the main pain point I
> encounter
> > when using pipes in R, and some suggestions on what could be done
> about
> > it. The issue is that the pipe operator can't be placed at the
> start of a
> > line of code (except in parentheses). That's no different than
> any binary
> > operator in R, but I find it's a source of difficulty for the
> pipe because
> > of how pipes are often used.
> >
> > [I'm assuming here that my usage is fairly typical of a lot of
> users; at
> > any rate, I don't think I'm *too* unusual.]
> >
> > === Why this is a problem ===
> >
> > It's very common (for me, and I suspect for many users of dplyr)
> to write
> > multi-step pipelines and put each step on its own line for
> readability.
> > Something like this:
> >
> > ### Example 1 ###
> > my_data_frame_1 %>%
> > filter(some_conditions_1) %>%
> > inner_join(my_data_frame_2, by = some_columns_1) %>%
> > group_by(some_columns_2) %>%
> > summarize(some_aggregate_functions_1) %>%
> > filter(some_conditions_2) %>%
> > left_join(my_data_frame_3, by = some_columns_3) %>%
> > group_by(some_columns_4) %>%
> > summarize(some_aggregate_functions_2) %>%
> > arrange(some_columns_5)
> >
> > [I guess some might consider this an overly long pipeline; for me
> it's
> > pretty typical. I *could* split it up by assigning intermediate
> results to
> > variables, but much of the value I get from the pipe is that it
> lets my
> > code communicate which results are temporary, and which will be
> used again
> > later. Assigning variables for single-use results would remove that
> > expressiveness.]
> >
> > I would prefer (for reasons I'll explain) to be able to write the
> above
> > example like this, which isn't valid R:
> >
> > ### Example 2 (not valid R) ###
> > my_data_frame_1
> > %>% filter(some_conditions_1)
> > %>% inner_join(my_data_frame_2, by = some_columns_1)
> > %>% group_by(some_columns_2)
> > %>% summarize(some_aggregate_functions_1)
> > %>% filter(some_conditions_2)
> > %>% left_join(my_data_frame_3, by = some_columns_3)
> > %>% group_by(some_columns_4)
> > %>% summarize(some_aggregate_functions_2)
> > %>% arrange(some_columns_5)
> >
> > One (minor) advantage is obvious: It lets you easily line up the
> pipes,
> > which means that you can see at a glance that the whole block is
> a single
> > pipeline, and you'd immediately notice if you inadvertently
> omitted a pipe,
> > which otherwise can lead to confusing output. [It's also
> aesthetically
> > pleasing, especially when %>% is replaced with |>, but that's
> subjective.]
> >
> > But the bigger issue happens when I want to re-run just *part* of the
> > pipeline. I do this often when debugging: if the output of the
> pipeline
> > seems wrong, I re-run the first few steps and check the output, then
> > include a little more and re-run again, etc., until I locate my
> mistake.
> > Working in an interactive notebook environment, this involves
> using the
> > cursor to select just the part of the code I want to re-run.
> >
> > It's fast and easy to select *entire* lines of code, but
> unfortunately with
> > the pipes placed at the end of the line I must instead select
> everything
> > *except* the last three characters of the line (the last two
> characters for
> > the new pipe). Then when I want to re-run the same partial
> pipeline with
> > the next line of code included, I can't just press SHIFT+Down to
> select it
> > as I otherwise would, but instead must move the cursor
> horizontally to a
> > position three characters before the end of *that* line (which is
> generally
> > different due to varying line lengths). And so forth each time I
> want to
> > include an additional line.
> >
> > Moreover, with the staggered positions of the pipes at the end of
> each
> > line, it's very easy to accidentally select the final pipe on a
> line, and
> > then sit there for a moment wondering if the environment has stopped
> > responding before realizing it's just waiting for further input
> (i.e., for
> > the right-hand side). These small delays and disruptions add up
> over the
> > course of a day.
> >
> > This desire to select and re-run the first part of a pipeline is
> also the
> > reason why it doesn't suffice to achieve syntax like my "Example
> 2" by
> > wrapping the entire pipeline in parentheses. That's of no use if
> I want to
> > re-run a selection that doesn't include the final close-paren.
> >
> > === Possible Solutions ===
> >
> > I can think of two, but maybe there are others. The first would make
> > "Example 2" into valid code, and the second would allow you to run a
> > selection that included a trailing pipe.
> >
> > Solution 1: Add a special case to how R is parsed, so if the first
> > (non-whitespace) token after an end-line is a pipe, that pipe
> gets moved to
> > before the end-line.
> > - Argument for: This lets you write code like example 2, which
> > addresses the pain point around re-running part of a pipeline,
> and has
> > advantages for readability. Also, since starting a line with a pipe
> > operator is currently invalid, the change wouldn't break any
> working code.
> > - Argument against: It would make the behavior of %>%
> inconsistent with
> > that of other binary operators in R. (However, this objection
> might not
> > apply to the new pipe, |>, which I understand is being
> implemented as a
> > syntax transformation rather than a binary operator.)
> >
> > Solution 2: Ignore the pipe operator if it occurs as the final
> token of
> > the code being executed.
> > - Argument for: This would mean the user could select and
> re-run the
> > first few lines of a longer pipeline (selecting *entire* lines),
> avoiding
> > the difficulties described above.
> > - Argument against: This means that %>% would be valid even
> if it
> > occurred without a right-hand side, which is inconsistent with other
> > operators in R. (But, as above, this objection might not apply
> to |>.)
> > Also, this solution still doesn't enable the syntax of "Example
> 2", with
> > its readability benefit.
> >
> > Thanks for reading this and considering it.
> >
> > - Tim Goodman
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> <https://stat.ethz.ch/mailman/listinfo/r-devel>
> >
>
More information about the R-devel
mailing list