[Rd] the pipe |> and line breaks in pipelines
Timothy Goodman
t|m@goodm@n @end|ng |rom gm@||@com
Wed Dec 9 06:45:04 CET 2020
Hi,
I'm a data scientist who routinely uses R in my day-to-day work, for tasks
such as cleaning and transforming data, exploratory data analysis, etc.
This includes frequent use of the pipe operator from the magrittr and dplyr
libraries, %>%. So, I was pleased to hear about the recent work on a
native pipe operator, |>.
This seems like a good time to bring up the main pain point I encounter
when using pipes in R, and some suggestions on what could be done about
it. The issue is that the pipe operator can't be placed at the start of a
line of code (except in parentheses). That's no different than any binary
operator in R, but I find it's a source of difficulty for the pipe because
of how pipes are often used.
[I'm assuming here that my usage is fairly typical of a lot of users; at
any rate, I don't think I'm *too* unusual.]
=== Why this is a problem ===
It's very common (for me, and I suspect for many users of dplyr) to write
multi-step pipelines and put each step on its own line for readability.
Something like this:
### Example 1 ###
my_data_frame_1 %>%
filter(some_conditions_1) %>%
inner_join(my_data_frame_2, by = some_columns_1) %>%
group_by(some_columns_2) %>%
summarize(some_aggregate_functions_1) %>%
filter(some_conditions_2) %>%
left_join(my_data_frame_3, by = some_columns_3) %>%
group_by(some_columns_4) %>%
summarize(some_aggregate_functions_2) %>%
arrange(some_columns_5)
[I guess some might consider this an overly long pipeline; for me it's
pretty typical. I *could* split it up by assigning intermediate results to
variables, but much of the value I get from the pipe is that it lets my
code communicate which results are temporary, and which will be used again
later. Assigning variables for single-use results would remove that
expressiveness.]
I would prefer (for reasons I'll explain) to be able to write the above
example like this, which isn't valid R:
### Example 2 (not valid R) ###
my_data_frame_1
%>% filter(some_conditions_1)
%>% inner_join(my_data_frame_2, by = some_columns_1)
%>% group_by(some_columns_2)
%>% summarize(some_aggregate_functions_1)
%>% filter(some_conditions_2)
%>% left_join(my_data_frame_3, by = some_columns_3)
%>% group_by(some_columns_4)
%>% summarize(some_aggregate_functions_2)
%>% arrange(some_columns_5)
One (minor) advantage is obvious: It lets you easily line up the pipes,
which means that you can see at a glance that the whole block is a single
pipeline, and you'd immediately notice if you inadvertently omitted a pipe,
which otherwise can lead to confusing output. [It's also aesthetically
pleasing, especially when %>% is replaced with |>, but that's subjective.]
But the bigger issue happens when I want to re-run just *part* of the
pipeline. I do this often when debugging: if the output of the pipeline
seems wrong, I re-run the first few steps and check the output, then
include a little more and re-run again, etc., until I locate my mistake.
Working in an interactive notebook environment, this involves using the
cursor to select just the part of the code I want to re-run.
It's fast and easy to select *entire* lines of code, but unfortunately with
the pipes placed at the end of the line I must instead select everything
*except* the last three characters of the line (the last two characters for
the new pipe). Then when I want to re-run the same partial pipeline with
the next line of code included, I can't just press SHIFT+Down to select it
as I otherwise would, but instead must move the cursor horizontally to a
position three characters before the end of *that* line (which is generally
different due to varying line lengths). And so forth each time I want to
include an additional line.
Moreover, with the staggered positions of the pipes at the end of each
line, it's very easy to accidentally select the final pipe on a line, and
then sit there for a moment wondering if the environment has stopped
responding before realizing it's just waiting for further input (i.e., for
the right-hand side). These small delays and disruptions add up over the
course of a day.
This desire to select and re-run the first part of a pipeline is also the
reason why it doesn't suffice to achieve syntax like my "Example 2" by
wrapping the entire pipeline in parentheses. That's of no use if I want to
re-run a selection that doesn't include the final close-paren.
=== Possible Solutions ===
I can think of two, but maybe there are others. The first would make
"Example 2" into valid code, and the second would allow you to run a
selection that included a trailing pipe.
Solution 1: Add a special case to how R is parsed, so if the first
(non-whitespace) token after an end-line is a pipe, that pipe gets moved to
before the end-line.
- Argument for: This lets you write code like example 2, which
addresses the pain point around re-running part of a pipeline, and has
advantages for readability. Also, since starting a line with a pipe
operator is currently invalid, the change wouldn't break any working code.
- Argument against: It would make the behavior of %>% inconsistent with
that of other binary operators in R. (However, this objection might not
apply to the new pipe, |>, which I understand is being implemented as a
syntax transformation rather than a binary operator.)
Solution 2: Ignore the pipe operator if it occurs as the final token of
the code being executed.
- Argument for: This would mean the user could select and re-run the
first few lines of a longer pipeline (selecting *entire* lines), avoiding
the difficulties described above.
- Argument against: This means that %>% would be valid even if it
occurred without a right-hand side, which is inconsistent with other
operators in R. (But, as above, this objection might not apply to |>.)
Also, this solution still doesn't enable the syntax of "Example 2", with
its readability benefit.
Thanks for reading this and considering it.
- Tim Goodman
[[alternative HTML version deleted]]
More information about the R-devel
mailing list