[R-sig-DB] SQL speed vs R

Hadley Wickham h@w|ckh@m @end|ng |rom gm@||@com
Wed May 7 14:08:21 CEST 2014


> I think of R and SQL as mostly complementary. The example query is simple in
> SQL, and relatively ugly in R. But also, I would expect it to be much faster
> with an SQL engine, after all, that is what SQL engines are designed to do.
> So the implied suggestion that it might be faster in R really surprises me.
> (Implied by the original question and also somewhat endorsed by some
> responses.)

I think your expectation is wrong. SQL engines are not designed to do
fast in-memory analysis. They are designed to be:

* fast at adding new rows (in an ACID manner)
* fast at working with on disk

My expectation (after working on dplyr) is that if you can fit the
data in memory, R should be faster. If it's not faster, it's a sign
that the underlying R code is poorly written.

The right indexes can make a RDMS competitive with naive loops on
in-memory data, but I rarely find subsetting to be a bottleneck.
Indexes help a little with counts, but don't help at all for any other
summaries

> 1/ Has anyone ever had the experience that an R "query" of a dataframe is
> faster than an SQL database for a non-trivial SQL style query on a modestly
> large dataset?  If so, what are the parameters that one should think about
> when considering if a "pure R" solution might be faster?

If it fits in memory, R _should_ be faster.

> 2/ Assuming no in 1/, is there any programming language that would be
> preferred to an SQL engine for doing SQL style queries? (Please leave aside
> the debate over your favourite flavour of SQL engine, and the possibility
> that the database would be better in some non-SQL structure.) If yes, I
> would really appreciate some explanation. I think of the fact that general
> programming languages needs to do other things besides SQL queries as
> something like a constraint, and there is a theorem that a constrained
> optimization can never be better than an unconstrained one.

Not a programming language, but I think it's useful to be aware of
columnar databases (e.g. monetdb). These are column based (liked R),
not row based (like most RDMSs), which means that they're slower at
adding new rows, but faster at doing analytic summaries.

Hadley

-- 
http://had.co.nz/




More information about the R-sig-DB mailing list