[R-sig-DB] SQL speed vs R

Paul Gilbert pg||bert902 @end|ng |rom gm@||@com
Wed May 7 02:42:13 CEST 2014


(snipped from R-help thread "SQL vs R"

I'm not sure if anyone else has been following the R-help chatter on 
this but, in addition to much unnecessary acrimony,  I am surprised by 
an implied question:

>How do I do something like this without using sqldf?
...
>e <- sqldf("SELECT f, COUNT(*) FROM b GROUP BY f ORDER BY f")
...
>My requirements are that simple. One table, 11 fields, of which 3 are
>interesting, 30 Million records, growing daily by between 300000.
...
> It's in MySQL, at the moment roughly 1.8 GB, if I pull it into a
> dataframe it saves to 180MB. I work from the dataframe.
>
> But, it's not only a size issue it's also a speed issue and hence I
> don't care what I am going to use, as long as it is fast.

(I am assuming it is the original extract that is considered to be slow. 
If the statement really means that working with a very large dataset is 
much slower than working with a small subset, then no one should be 
surprised. The SQL equivalent to the extracted dataframe would be to 
make a temporary table and work with that.)

I think of R and SQL as mostly complementary. The example query is 
simple in SQL, and relatively ugly in R. But also, I would expect it to 
be much faster with an SQL engine, after all, that is what SQL engines 
are designed to do. So the implied suggestion that it might be faster in 
R really surprises me. (Implied by the original question and also 
somewhat endorsed by some responses.)

The things that would seem to me to be obvious candidates for examination:
  - Is the SQL server running out of memory and starting to swap?
  - Are the important fields indexed?

So my questions for R-sig-db followers are:

1/ Has anyone ever had the experience that an R "query" of a dataframe 
is faster than an SQL database for a non-trivial SQL style query on a 
modestly large dataset?  If so, what are the parameters that one should 
think about when considering if a "pure R" solution might be faster?

2/ Assuming no in 1/, is there any programming language that would be 
preferred to an SQL engine for doing SQL style queries? (Please leave 
aside the debate over your favourite flavour of SQL engine, and the 
possibility that the database would be better in some non-SQL 
structure.) If yes, I would really appreciate some explanation. I think 
of the fact that general programming languages needs to do other things 
besides SQL queries as something like a constraint, and there is a 
theorem that a constrained optimization can never be better than an 
unconstrained one.

3/ In addition to my two candidates for examination, above, what are the 
most obvious things one should consider when finding that queries are 
slow? (I do realize that people spend a lot of time learning how to 
optimize databases, I'm really just wondering what the "big" things are.)

Thanks,
Paul




More information about the R-sig-DB mailing list