[R-sig-DB] SQL speed vs R

Wed May 7 16:54:08 CEST 2014

To throw a cat amongst the pigeons...

Postgres  and some other backends can use R as a built in language at the
server side, e.g. http://www.joeconway.com/plr/
I'm afraid I have not used any of these in a serious project.

Kind regards,
Sean

On 7 May 2014 15:30, jim holtman <jholtman using gmail.com> wrote:

> The place where I have seen a large difference is when you have to "join" a
> large number of tables to get results.  The problem with doing is R is the
> restriction of how much can fit into memory and the time that it takes to
> download the data across a data connection if the data is remote from the
> machine that you are doing R on.  As an aside, we have a number of stored
> procedures on the SQL machine that will call R locally to do some of the
> processing.  You have to look at the tradeoff of the time to get the data
> vs. the time needed to aggregate the results.  An SQL server is much better
> at doing joins across ten different tables and being able to scale that
> across multiple processors if they are available.  The resulting table is
> typically much smaller when it comes time to download it to your local R
> machine.  There is a big difference is these tables start to exceed the
> memory that you have on your local R machine.
>
> I would agree that if I can fit the data into memory, there are packages
> (data.table & dplyr) that will make the accessing/aggregation of data
> pretty fast, but if I can do some of that on a SQL machine, the amount of
> data that I have to download and locally process may be a lot smaller.
>
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
>
> On Wed, May 7, 2014 at 10:08 AM, Tim Keitt <tkeitt using utexas.edu> wrote:
>
> > On Wed, May 7, 2014 at 7:08 AM, Hadley Wickham <h.wickham using gmail.com>
> > wrote:
> >
> > > > I think of R and SQL as mostly complementary. The example query is
> > > simple in
> > > > SQL, and relatively ugly in R. But also, I would expect it to be much
> > > faster
> > > > with an SQL engine, after all, that is what SQL engines are designed
> to
> > > do.
> > > > So the implied suggestion that it might be faster in R really
> surprises
> > > me.
> > > > (Implied by the original question and also somewhat endorsed by some
> > > > responses.)
> > >
> > > I think your expectation is wrong. SQL engines are not designed to do
> > > fast in-memory analysis. They are designed to be:
> > >
> > > * fast at adding new rows (in an ACID manner)
> > > * fast at working with on disk
> > >
> > > My expectation (after working on dplyr) is that if you can fit the
> > > data in memory, R should be faster. If it's not faster, it's a sign
> > > that the underlying R code is poorly written.
> > >
> > > The right indexes can make a RDMS competitive with naive loops on
> > > in-memory data, but I rarely find subsetting to be a bottleneck.
> > > Indexes help a little with counts, but don't help at all for any other
> > > summaries
> > >
> > > > 1/ Has anyone ever had the experience that an R "query" of a
> dataframe
> > is
> > > > faster than an SQL database for a non-trivial SQL style query on a
> > > modestly
> > > > large dataset?  If so, what are the parameters that one should think
> > > about
> > > > when considering if a "pure R" solution might be faster?
> > >
> > > If it fits in memory, R _should_ be faster.
> > >
> > > > 2/ Assuming no in 1/, is there any programming language that would be
> > > > preferred to an SQL engine for doing SQL style queries? (Please leave
> > > aside
> > > > the debate over your favourite flavour of SQL engine, and the
> > possibility
> > > > that the database would be better in some non-SQL structure.) If
> yes, I
> > > > would really appreciate some explanation. I think of the fact that
> > > general
> > > > programming languages needs to do other things besides SQL queries as
> > > > something like a constraint, and there is a theorem that a
> constrained
> > > > optimization can never be better than an unconstrained one.
> > >
> > > Not a programming language, but I think it's useful to be aware of
> > > columnar databases (e.g. monetdb). These are column based (liked R),
> > > not row based (like most RDMSs), which means that they're slower at
> > > adding new rows, but faster at doing analytic summaries.
> > >
> >
> > All very good points. Just remember that the R in RDMS stands for
> > relational. They facilitate very large and complex joins as opposed for
> > simple filtering or aggregation. Column stores are not always ideal for
> > complex joins but very good for large scans.
> >
> > THK
> >
> >
> > >
> > > Hadley
> > >
> > > --
> > > http://had.co.nz/
> > >
> > > _______________________________________________
> > > R-sig-DB mailing list -- R Special Interest Group
> > > R-sig-DB using r-project.org
> > > https://stat.ethz.ch/mailman/listinfo/r-sig-db
> > >
> >
> >
> >
> > --
> > http://www.keittlab.org/
> >
> >         [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-DB mailing list -- R Special Interest Group
> > R-sig-DB using r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-db
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-DB mailing list -- R Special Interest Group
> R-sig-DB using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-db
>

	[[alternative HTML version deleted]]