[Rd] the pipe |> and line breaks in pipelines
Ben Bolker
bbo|ker @end|ng |rom gm@||@com
Wed Dec 9 21:19:19 CET 2020
FWIW there is previous discussion of this in a twitter thread from May:
https://twitter.com/bolkerb/status/1258542150620332039
at the end I suggested defining something like .__END <- identity() as a
pipe-ender.
On 12/9/20 2:58 PM, Kevin Ushey wrote:
> I agree with Duncan that the right solution is to wrap the pipe
> expression with parentheses. Having the parser treat newlines
> differently based on whether the session is interactive, or on what
> type of operator happens to follow a newline, feels like a pretty big
> can of worms.
>
> I think this (or something similar) would accomplish what you want
> while still retaining the nice aesthetics of the pipe expression, with
> a minimal amount of syntax "noise":
>
> result <- (
> data
> |> op1()
> |> op2()
> )
>
> For interactive sessions where you wanted to execute only parts of the
> pipeline at a time, I could see that being accomplished by the editor
> -- it could transform the expression so that it could be handled by R,
> either by hoisting the pipe operator(s) up a line, or by wrapping the
> to-be-executed expression in parentheses for you. If such a style of
> coding became popular enough, I'm sure the developers of such editors
> would be interested and willing to support this ...
>
> Perhaps more importantly, it would be much easier to accomplish than a
> change to the behavior of the R parser, and it would be work that
> wouldn't have to be maintained by the R Core team.
>
> Best,
> Kevin
>
> On Wed, Dec 9, 2020 at 11:34 AM Timothy Goodman <timsgoodman using gmail.com> wrote:
>>
>> If I type my_data_frame_1 and press Enter (or Ctrl+Enter to execute the
>> command in the Notebook environment I'm using) I certainly *would* expect R
>> to treat it as a complete statement.
>>
>> But what I'm talking about is a different case, where I highlight a
>> multi-line statement in my notebook:
>>
>> my_data_frame1
>> |> filter(some_conditions_1)
>>
>> and then press Ctrl+Enter. Or, I suppose the equivalent would be to run an
>> R script containing those two lines of code, or to run a multi-line
>> statement like that from the console (which in RStudio I can do by pressing
>> Shift+Enter between the lines.)
>>
>> In those cases, R could either (1) Give an error message [the current
>> behavior], or (2) understand that the first line is meant to be piped to
>> the second. The second option would be significantly more useful, and is
>> almost certainly what the user intended.
>>
>> (For what it's worth, there are some languages, such as Javascript, that
>> consider the first token of the next line when determining if the previous
>> line was complete. JavaScript's rules around this are overly complicated,
>> but a rule like "a pipe following a line break is treated as continuing the
>> previous line" would be much simpler. And while it might be objectionable
>> to treat the operator %>% different from other operators, the addition of
>> |>, which isn't truly an operator at all, seems like the right time to
>> consider it.)
>>
>> -Tim
>>
>> On Wed, Dec 9, 2020 at 3:12 AM Duncan Murdoch <murdoch.duncan using gmail.com>
>> wrote:
>>
>>> The requirement for operators at the end of the line comes from the
>>> interactive nature of R. If you type
>>>
>>> my_data_frame_1
>>>
>>> how could R know that you are not done, and are planning to type the
>>> rest of the expression
>>>
>>> %>% filter(some_conditions_1)
>>> ...
>>>
>>> before it should consider the expression complete? The way languages
>>> like C do this is by requiring a statement terminator at the end. You
>>> can also do it by wrapping the entire thing in parentheses ().
>>>
>>> However, be careful: Don't use braces: they don't work. And parens
>>> have the side effect of removing invisibility from the result (which is
>>> a design flaw or bonus, depending on your point of view). So I actually
>>> wouldn't advise this workaround.
>>>
>>> Duncan Murdoch
>>>
>>>
>>> On 09/12/2020 12:45 a.m., Timothy Goodman wrote:
>>>> Hi,
>>>>
>>>> I'm a data scientist who routinely uses R in my day-to-day work, for
>>> tasks
>>>> such as cleaning and transforming data, exploratory data analysis, etc.
>>>> This includes frequent use of the pipe operator from the magrittr and
>>> dplyr
>>>> libraries, %>%. So, I was pleased to hear about the recent work on a
>>>> native pipe operator, |>.
>>>>
>>>> This seems like a good time to bring up the main pain point I encounter
>>>> when using pipes in R, and some suggestions on what could be done about
>>>> it. The issue is that the pipe operator can't be placed at the start of
>>> a
>>>> line of code (except in parentheses). That's no different than any
>>> binary
>>>> operator in R, but I find it's a source of difficulty for the pipe
>>> because
>>>> of how pipes are often used.
>>>>
>>>> [I'm assuming here that my usage is fairly typical of a lot of users; at
>>>> any rate, I don't think I'm *too* unusual.]
>>>>
>>>> === Why this is a problem ===
>>>>
>>>> It's very common (for me, and I suspect for many users of dplyr) to write
>>>> multi-step pipelines and put each step on its own line for readability.
>>>> Something like this:
>>>>
>>>> ### Example 1 ###
>>>> my_data_frame_1 %>%
>>>> filter(some_conditions_1) %>%
>>>> inner_join(my_data_frame_2, by = some_columns_1) %>%
>>>> group_by(some_columns_2) %>%
>>>> summarize(some_aggregate_functions_1) %>%
>>>> filter(some_conditions_2) %>%
>>>> left_join(my_data_frame_3, by = some_columns_3) %>%
>>>> group_by(some_columns_4) %>%
>>>> summarize(some_aggregate_functions_2) %>%
>>>> arrange(some_columns_5)
>>>>
>>>> [I guess some might consider this an overly long pipeline; for me it's
>>>> pretty typical. I *could* split it up by assigning intermediate results
>>> to
>>>> variables, but much of the value I get from the pipe is that it lets my
>>>> code communicate which results are temporary, and which will be used
>>> again
>>>> later. Assigning variables for single-use results would remove that
>>>> expressiveness.]
>>>>
>>>> I would prefer (for reasons I'll explain) to be able to write the above
>>>> example like this, which isn't valid R:
>>>>
>>>> ### Example 2 (not valid R) ###
>>>> my_data_frame_1
>>>> %>% filter(some_conditions_1)
>>>> %>% inner_join(my_data_frame_2, by = some_columns_1)
>>>> %>% group_by(some_columns_2)
>>>> %>% summarize(some_aggregate_functions_1)
>>>> %>% filter(some_conditions_2)
>>>> %>% left_join(my_data_frame_3, by = some_columns_3)
>>>> %>% group_by(some_columns_4)
>>>> %>% summarize(some_aggregate_functions_2)
>>>> %>% arrange(some_columns_5)
>>>>
>>>> One (minor) advantage is obvious: It lets you easily line up the pipes,
>>>> which means that you can see at a glance that the whole block is a single
>>>> pipeline, and you'd immediately notice if you inadvertently omitted a
>>> pipe,
>>>> which otherwise can lead to confusing output. [It's also aesthetically
>>>> pleasing, especially when %>% is replaced with |>, but that's
>>> subjective.]
>>>>
>>>> But the bigger issue happens when I want to re-run just *part* of the
>>>> pipeline. I do this often when debugging: if the output of the pipeline
>>>> seems wrong, I re-run the first few steps and check the output, then
>>>> include a little more and re-run again, etc., until I locate my mistake.
>>>> Working in an interactive notebook environment, this involves using the
>>>> cursor to select just the part of the code I want to re-run.
>>>>
>>>> It's fast and easy to select *entire* lines of code, but unfortunately
>>> with
>>>> the pipes placed at the end of the line I must instead select everything
>>>> *except* the last three characters of the line (the last two characters
>>> for
>>>> the new pipe). Then when I want to re-run the same partial pipeline with
>>>> the next line of code included, I can't just press SHIFT+Down to select
>>> it
>>>> as I otherwise would, but instead must move the cursor horizontally to a
>>>> position three characters before the end of *that* line (which is
>>> generally
>>>> different due to varying line lengths). And so forth each time I want to
>>>> include an additional line.
>>>>
>>>> Moreover, with the staggered positions of the pipes at the end of each
>>>> line, it's very easy to accidentally select the final pipe on a line, and
>>>> then sit there for a moment wondering if the environment has stopped
>>>> responding before realizing it's just waiting for further input (i.e.,
>>> for
>>>> the right-hand side). These small delays and disruptions add up over the
>>>> course of a day.
>>>>
>>>> This desire to select and re-run the first part of a pipeline is also the
>>>> reason why it doesn't suffice to achieve syntax like my "Example 2" by
>>>> wrapping the entire pipeline in parentheses. That's of no use if I want
>>> to
>>>> re-run a selection that doesn't include the final close-paren.
>>>>
>>>> === Possible Solutions ===
>>>>
>>>> I can think of two, but maybe there are others. The first would make
>>>> "Example 2" into valid code, and the second would allow you to run a
>>>> selection that included a trailing pipe.
>>>>
>>>> Solution 1: Add a special case to how R is parsed, so if the first
>>>> (non-whitespace) token after an end-line is a pipe, that pipe gets moved
>>> to
>>>> before the end-line.
>>>> - Argument for: This lets you write code like example 2, which
>>>> addresses the pain point around re-running part of a pipeline, and has
>>>> advantages for readability. Also, since starting a line with a pipe
>>>> operator is currently invalid, the change wouldn't break any working
>>> code.
>>>> - Argument against: It would make the behavior of %>% inconsistent
>>> with
>>>> that of other binary operators in R. (However, this objection might not
>>>> apply to the new pipe, |>, which I understand is being implemented as a
>>>> syntax transformation rather than a binary operator.)
>>>>
>>>> Solution 2: Ignore the pipe operator if it occurs as the final token
>>> of
>>>> the code being executed.
>>>> - Argument for: This would mean the user could select and re-run the
>>>> first few lines of a longer pipeline (selecting *entire* lines), avoiding
>>>> the difficulties described above.
>>>> - Argument against: This means that %>% would be valid even if it
>>>> occurred without a right-hand side, which is inconsistent with other
>>>> operators in R. (But, as above, this objection might not apply to |>.)
>>>> Also, this solution still doesn't enable the syntax of "Example 2", with
>>>> its readability benefit.
>>>>
>>>> Thanks for reading this and considering it.
>>>>
>>>> - Tim Goodman
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-devel using r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list