[R] Using and abusing %>% (was Re: Why can't I access this type?)
Patrick Connolly
p_connolly at slingshot.co.nz
Sat Mar 28 08:48:52 CET 2015
On Fri, 27-Mar-2015 at 03:27PM +0100, Henric Winell wrote:
|> On 2015-03-26 07:48, Patrick Connolly wrote:
|>
|> >On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:
|> >
|> >...
|> >
|> >|> Well... Opinions may perhaps differ, but apart from '%>%' being
|> >|> butt-ugly it's also fairly slow:
|> >
|> >Beauty, it is said, is in the eye of the beholder. I'm impressed by
|> >the way using %>% reduces or eliminates complicated nested brackets.
|>
|> I didn't dispute whether '%>%' may be useful -- I just pointed out
Likewise I didn't dispute that it might not be as fast as other ways,
but I was disputing the claim that it was ugly.
|> that it is slow. However, it is only part of the problem:
|> 'filter()' and 'select()', although aesthetically pleasing, also
|> seem to be slow:
So not 'butt ugly' like '%>%'?
|>
....
|> > mb <- microbenchmark(
|> + f1(), f2(), f3(), f4(), f5(), f6(),
|> + times = 1000L
|> + )
|> > print(mb, signif = 3L)
|> Unit: microseconds
|> expr min lq mean median uq max neval cld
|> f1() 115 124 134.8812 129 134 1500 1000 a
|> f2() 128 141 147.4694 145 151 1520 1000 a
|> f3() 303 328 344.3175 338 348 1740 1000 b
|> f4() 458 494 518.0830 510 523 1890 1000 c
|> f5() 806 848 887.7270 875 894 3510 1000 d
|> f6() 971 1010 1056.5659 1040 1060 3110 1000 e
|>
|> So, using '%>%', but leaving 'filter()' and 'select()' out of the
|> equation, as in 'f4()' is only half as bad as the "full" 'dplyr'
|> idiom in 'f6()'. In this case, since we're talking microseconds,
|> the speed-up is negligible but that *is* beside the point.
Agreed that the more 'dplyr' used the slower it gets but don't agree
that it's an issue except in packages that should be optimized. The
lack of speed won't stop me using it any more than I'll stop using
dataframes because matrices are much faster than them. The OP's
example can be done using matrix syntax:
state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE]
which is more than an order of magnitude faster than subscripting a
dataframe. See No 4. here:
microbenchmark(## 1. using subset()
subset(all.states, all.states$Frost > 150, select = c("state","Frost")),
## 2. standard R indexing
all.states[all.states$Frost > 150, c("state","Frost")],
## 3. leave out redundant 'state' column
all.states[all.states$Frost > 150, "Frost", drop = FALSE],
## 4. avoid using 'slow' dataframes altogether
state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE],
## 5. easy, slow way without square brackets or quote marks
all.states %>% filter(Frost > 150) %>% select(state, Frost),
times = 1000L
)
Unit: microseconds
expr
subset(all.states, all.states$Frost > 150, select = c("state", "Frost"))
all.states[all.states$Frost > 150, c("state", "Frost")]
all.states[all.states$Frost > 150, "Frost", drop = FALSE]
state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE]
all.states %>% filter(Frost > 150) %>% select(state, Frost)
min lq mean median uq max neval cld
223.960 229.9290 236.16557 232.4060 241.4165 291.083 1000 c
177.187 182.6075 203.04666 185.1475 194.4815 7259.760 1000 c
125.281 130.4835 135.83826 132.6985 141.7375 210.576 1000 b
6.442 10.3860 10.61733 11.0405 11.4855 25.077 1000 a
1416.592 1437.7015 1562.91898 1447.5695 1473.4440 9394.071 1000 d
>
[...]
|>
|> >In this tiny example it's not obvious but it's very clear if the
|> >objective is to sort the dataframe by three or four columns and
|> >various lots of aggregation then returning a largish number of
|> >consecutive columns, omitting the rest. It's very easy to see what's
|> >going on without the need for intermediate objects.
|>
|> Why are you opposed to using intermediate objects? In this case,
I'm not opposed to intermediate objects nor to dogs. It's just easier
to keep things tidy without either.
|> as can be seen from 'f3()', it will also have the benefit of being
|> faster than either '%>%' or the "full" 'dplyr' idiom.
|>
|> >|> [...]
|> >
|> >It's no surprise that instructing a computer in something closer to
|> >human language is an order of magnitude slower.
|>
|> Certainly not true, at least for compiled languages. In any case,
|> judging from off-list correspondence, it definitely came as a
|> surprise to some R users...
|>
|> Given that '%>%' is so heavily marketed through 'dplyr', where the
|> latter is said to provide "blazing fast performance for in-memory
|> data by writing key pieces in C++" and "a fast, consistent tool for
|> working with data frame like objects, both in memory and out of
|> memory", I don't think it's far-fetched to expect that it should be
|> more performant than base R.
|>
I've never come across 'marketing' of free software. Evidently that's
a looser use of the word.
...
|> >I spend 3 or 4 orders of magnitude more time writing code than running it.
|>
|> You and me both. But that doesn't mean speed is of no or little importance.
I never claimed it was. Tardiness hasn't yet become an issue for me.
When it does, I'll revert to the old ways.
|>
|> >It's much more important to me to be able to read and modify than
|> > it is to have it run at optimum speed.
|>
|> Good for you. But surely, if this is your goal, nothing beats
|> intermediate objects.
Nothing except chaining, that is. I went 16 years without it and now
find it amazing how useful it is. As they say: You're never too old
to learn.
|> And like I said, it may still be faster than the 'dplyr' idiom.
|>
|> >|> Of course, this doesn't matter for interactive one-off use. But
|> >|> lately I've seen examples of the '%>%' operator creeping into
|> >|> functions in packages.
|> >
|> >That could indicate that %>% is seductively easy to use. It's
|> >probably true that there are places where it should be done the hard
|> >way.
|>
|> We all know how easy it is to write ugly and sluggish code in R.
|> But 'foo[i,j]' is neither ugly nor sluggish and certainly not "the
|> hard way."
I meant to put a ':-)' in there. Such adjectives as 'easy' and 'hard' are
relative. There's little difference in difficulty at each step, but
integrating them and revising later are considerably easier using the
so-called "'dplyr' idiom" -- especially if each link in the chain is
on a separate line.
|>
|> >|> However, it would be nice to see a fast pipe operator as part of
|> >|> base R.
|>
|> Heck, it doesn't even have to be fast as long as it's a bit more
|> elegant than '%>%'.
IMHO, %>% fits in nicely with %/%, %%, and %in%. Elegance, like
beauty, is in the eye of the beholder.
|>
|>
|> Henric Winell
|>
|>
|>
|> >
|> >|>
|> >|>
|> >|> Henric Winell
|> >|>
|> >
--
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
___ Patrick Connolly
{~._.~} Great minds discuss ideas
_( Y )_ Average minds discuss events
(:_~*~_:) Small minds discuss people
(_)-(_) ..... Eleanor Roosevelt
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
More information about the R-help
mailing list