[R] Using and abusing %>% (was Re: Why can't I access this type?)

Fri Mar 27 15:27:19 CET 2015

On 2015-03-26 07:48, Patrick Connolly wrote:

> On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:
>
> ...
>
> |> Well...  Opinions may perhaps differ, but apart from '%>%' being
> |> butt-ugly it's also fairly slow:
>
> Beauty, it is said, is in the eye of the beholder.  I'm impressed by
> the way using %>% reduces or eliminates complicated nested brackets.

I didn't dispute whether '%>%' may be useful -- I just pointed out that 
it is slow.  However, it is only part of the problem: 'filter()' and 
'select()', although aesthetically pleasing, also seem to be slow:

 > all.states <- data.frame(state.x77, Name = rownames(state.x77))
 >
 > f1 <- function()
+     all.states[all.states$Frost > 150, c("Name", "Frost")]
 >
 > f2 <- function()
+     subset(all.states, Frost > 150, select = c("Name", "Frost"))
 >
 > f3 <- function() {
+     filt <- subset(all.states, Frost > 150)
+     subset(filt, select = c("Name", "Frost"))
+ }
 >
 > f4 <- function()
+     all.states %>% subset(Frost > 150) %>%
+         subset(select = c("Name", "Frost"))
 >
 > f5 <- function()
+     select(filter(all.states, Frost > 150), Name, Frost)
 >
 > f6 <- function()
+     all.states %>% filter(Frost > 150) %>% select(Name, Frost)
 >
 > mb <- microbenchmark(
+     f1(), f2(), f3(), f4(), f5(), f6(),
+     times = 1000L
+ )
 > print(mb, signif = 3L)
Unit: microseconds
  expr min   lq      mean median   uq  max neval   cld
  f1() 115  124  134.8812    129  134 1500  1000 a
  f2() 128  141  147.4694    145  151 1520  1000 a
  f3() 303  328  344.3175    338  348 1740  1000  b
  f4() 458  494  518.0830    510  523 1890  1000   c
  f5() 806  848  887.7270    875  894 3510  1000    d
  f6() 971 1010 1056.5659   1040 1060 3110  1000     e

So, using '%>%', but leaving 'filter()' and 'select()' out of the 
equation, as in 'f4()' is only half as bad as the "full" 'dplyr' idiom 
in 'f6()'.  In this case, since we're talking microseconds, the speed-up 
is negligible but that *is* beside the point.

> In this tiny example it's not obvious but it's very clear if the
> objective is to sort the dataframe by three or four columns and
> various lots of aggregation then returning a largish number of
> consecutive columns, omitting the rest.  It's very easy to see what's
> going on without the need for intermediate objects.

Why are you opposed to using intermediate objects?  In this case, as can 
be seen from 'f3()', it will also have the benefit of being faster than 
either '%>%' or the "full" 'dplyr' idiom.

> |> [...]
>
> It's no surprise that instructing a computer in something closer to
> human language is an order of magnitude slower.

Certainly not true, at least for compiled languages.  In any case, 
judging from off-list correspondence, it definitely came as a surprise 
to some R users...

Given that '%>%' is so heavily marketed through 'dplyr', where the 
latter is said to provide "blazing fast performance for in-memory data 
by writing key pieces in C++" and "a fast, consistent tool for working 
with data frame like objects, both in memory and out of memory", I don't 
think it's far-fetched to expect that it should be more performant than 
base R.

> I'm sure you'd get something even quicker using machine code.

Don't be ridiculous.  We're mainly discussing

all.states[all.states$Frost > 150, c("state", "Frost")]

vs.

all.states %>% filter(Frost > 150) %>% select(state, Frost)

i.e., pure R code.

> I spend 3 or 4 orders of magnitude more time writing code than running it.

You and me both.  But that doesn't mean speed is of no or little importance.

> It's much more important to me to be able to read and modify than
 > it is to have it run at optimum speed.

Good for you.  But surely, if this is your goal, nothing beats 
intermediate objects.  And like I said, it may still be faster than the 
'dplyr' idiom.

> |> Of course, this doesn't matter for interactive one-off use.  But
> |> lately I've seen examples of the '%>%' operator creeping into
> |> functions in packages.
>
> That could indicate that %>% is seductively easy to use.  It's
> probably true that there are places where it should be done the hard
> way.

We all know how easy it is to write ugly and sluggish code in R.  But 
'foo[i,j]' is neither ugly nor sluggish and certainly not "the hard way."

> |>  However, it would be nice to see a fast pipe operator as part of
> |> base R.

Heck, it doesn't even have to be fast as long as it's a bit more elegant 
than '%>%'.

Henric Winell

>
> |>
> |>
> |> Henric Winell
> |>
>