[R] Using and abusing %>% (was Re: Why can't I access this type?)
Henric Winell
nilsson.henric at gmail.com
Fri Mar 27 15:27:19 CET 2015
On 2015-03-26 07:48, Patrick Connolly wrote:
> On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:
>
> ...
>
> |> Well... Opinions may perhaps differ, but apart from '%>%' being
> |> butt-ugly it's also fairly slow:
>
> Beauty, it is said, is in the eye of the beholder. I'm impressed by
> the way using %>% reduces or eliminates complicated nested brackets.
I didn't dispute whether '%>%' may be useful -- I just pointed out that
it is slow. However, it is only part of the problem: 'filter()' and
'select()', although aesthetically pleasing, also seem to be slow:
> all.states <- data.frame(state.x77, Name = rownames(state.x77))
>
> f1 <- function()
+ all.states[all.states$Frost > 150, c("Name", "Frost")]
>
> f2 <- function()
+ subset(all.states, Frost > 150, select = c("Name", "Frost"))
>
> f3 <- function() {
+ filt <- subset(all.states, Frost > 150)
+ subset(filt, select = c("Name", "Frost"))
+ }
>
> f4 <- function()
+ all.states %>% subset(Frost > 150) %>%
+ subset(select = c("Name", "Frost"))
>
> f5 <- function()
+ select(filter(all.states, Frost > 150), Name, Frost)
>
> f6 <- function()
+ all.states %>% filter(Frost > 150) %>% select(Name, Frost)
>
> mb <- microbenchmark(
+ f1(), f2(), f3(), f4(), f5(), f6(),
+ times = 1000L
+ )
> print(mb, signif = 3L)
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 115 124 134.8812 129 134 1500 1000 a
f2() 128 141 147.4694 145 151 1520 1000 a
f3() 303 328 344.3175 338 348 1740 1000 b
f4() 458 494 518.0830 510 523 1890 1000 c
f5() 806 848 887.7270 875 894 3510 1000 d
f6() 971 1010 1056.5659 1040 1060 3110 1000 e
So, using '%>%', but leaving 'filter()' and 'select()' out of the
equation, as in 'f4()' is only half as bad as the "full" 'dplyr' idiom
in 'f6()'. In this case, since we're talking microseconds, the speed-up
is negligible but that *is* beside the point.
> In this tiny example it's not obvious but it's very clear if the
> objective is to sort the dataframe by three or four columns and
> various lots of aggregation then returning a largish number of
> consecutive columns, omitting the rest. It's very easy to see what's
> going on without the need for intermediate objects.
Why are you opposed to using intermediate objects? In this case, as can
be seen from 'f3()', it will also have the benefit of being faster than
either '%>%' or the "full" 'dplyr' idiom.
> |> [...]
>
> It's no surprise that instructing a computer in something closer to
> human language is an order of magnitude slower.
Certainly not true, at least for compiled languages. In any case,
judging from off-list correspondence, it definitely came as a surprise
to some R users...
Given that '%>%' is so heavily marketed through 'dplyr', where the
latter is said to provide "blazing fast performance for in-memory data
by writing key pieces in C++" and "a fast, consistent tool for working
with data frame like objects, both in memory and out of memory", I don't
think it's far-fetched to expect that it should be more performant than
base R.
> I'm sure you'd get something even quicker using machine code.
Don't be ridiculous. We're mainly discussing
all.states[all.states$Frost > 150, c("state", "Frost")]
vs.
all.states %>% filter(Frost > 150) %>% select(state, Frost)
i.e., pure R code.
> I spend 3 or 4 orders of magnitude more time writing code than running it.
You and me both. But that doesn't mean speed is of no or little importance.
> It's much more important to me to be able to read and modify than
> it is to have it run at optimum speed.
Good for you. But surely, if this is your goal, nothing beats
intermediate objects. And like I said, it may still be faster than the
'dplyr' idiom.
> |> Of course, this doesn't matter for interactive one-off use. But
> |> lately I've seen examples of the '%>%' operator creeping into
> |> functions in packages.
>
> That could indicate that %>% is seductively easy to use. It's
> probably true that there are places where it should be done the hard
> way.
We all know how easy it is to write ugly and sluggish code in R. But
'foo[i,j]' is neither ugly nor sluggish and certainly not "the hard way."
> |> However, it would be nice to see a fast pipe operator as part of
> |> base R.
Heck, it doesn't even have to be fast as long as it's a bit more elegant
than '%>%'.
Henric Winell
>
> |>
> |>
> |> Henric Winell
> |>
>
More information about the R-help
mailing list