[R-sig-DB] Basic R to SQL considerations

Tue Feb 25 04:45:13 CET 2014

(This is prompted by an R-help  post entitled '[R] Problem connecting to 
database via RPostgreSQL/RS-DBI: "could not connect" error' but this 
post is not specific to RPostgreSQL)

The general questions in the above R-help post reminded me of problems I 
had when starting to use SQL databases in conjunction with R. Below I 
try to provide some general comments and guidance with respect to 
options for using SQL databases from R. The main purpose of this post is 
so than anyone just starting to use R with SQL may find something to 
help with some of the initial decisions they must make. I hope that 
others will respond by posting additional detail where that seems 
warranted. My knowledge in some parts is limited.

In general SQL databases operate on a client/server model. R is a client 
program and will use the client interface of the database. The client 
interface drivers typically must be installed separately and prior to 
installing the R package that uses the client interface. Most client 
interfaces also include a standalone client (e.g. sqlite, mysql, psql). 
Among other things, this is useful for ensuring your client machine can 
connect to the database you intend to use, as it does not rely on R or 
any R packages. Before concluding that the R package does not work, you 
should ensure that the stand alone package does work.

You will also need a database to talk to, which means (excepting SQLite) 
a computer with the server software installed and set up, a database 
defined, and appropriate privileges for your client machine and userid. 
Often an R user will not need to do this because the server and database 
is provided by someone else (e.g. a "corporate" database within an 
organization.) When the database is provided, the choice of which R 
client package to use will be dictated by that. When the R user must 
also set up the database, then the main hurdle is likely to be the 
server side rather than the client side. (The user may need to learn a 
fair amount about system administration to install the server, and then 
a certain amount about database administration to set up the database 
with appropriate privileges.)

It is not unusual for the server to be a Unix or Linux machine and the 
client to have a different operating system, like Windows or Mac. In 
fact, this may be the most common situation. The SQL interfaces handle 
the cross-platform issues between server and client.

The notion of a "totally local" database fits in the SQL context in a 
way that will be unfamiliar to users of standalone programs (but see 
further below about SQLite). Even for a local database, the server must 
still be running, in this case on the same machine as the client, and 
must be listening for requests from client applications. That means it 
is configured to listen on a local socket, the loopback interface, or 
the IP address of the local machine, and recognizes the userid/password 
or some other security mechanism. This server can still respond to 
multiple client applications running simultaneously, for example, two R 
sessions, and to multiple users on the machine. So this is not "totally 
local" in the sense that many people may think of that term.

A message like "could not connect" or "Can't connect" in a client 
application indicates that the client cannot talk to the server. It is 
unlikely that you will get this far if the client software is not 
installed properly. The problem is that the server is not running, not 
connected to a network the client can reach, blocked by a firewall, not 
configured to listen for the client on the interface that the client is 
trying to use, or the client username/password or other authentication 
mechanism is not recognized by the server. The last problem can be 
either that the client is not using the proper credentials or the server 
is not set up properly to recognize the credentials.

You should get the same "could not connect"  message if you run the 
standalone client on the same machine. If you don't, then you are using 
different credentials with the standalone client than you are from the R 
session and that is the problem you need to resolve. You will probably 
need to talk with your system administrator or database administrator if 
you are trying to connect to a "corporate" database. In this case it is 
usually best to report the errors as produced by the standalone client. 
Otherwise you are likely to confuse the administrator into thinking the 
problem is in R. (And in many cases that will mean they then consider it 
is your problem rather than their problem.)

It is sometimes possible to work around a server setup that does not 
allow remote connections by setting up an ssh tunnel, so the connection 
appears to the server as if it is local. If the connection is over the 
Internet this might also add some additional security in the transmission.

  ***** Relative strengths of different options *****

As mentioned above, if you are trying to connect to a database provided 
by someone else then your choice of which client to use is already 
decided, and you do not need to consider this section.

SQLite (www.sqlite.org/) is by far the easiest to set up. Installation 
is almost automatic.  R package Windows and Mac binaries are available 
from CRAN. If you are thinking "totally local" then this is very likely 
the option to choose. It is "serverless". The database is in a file 
which could be read by more than one session simultaneously, but writing 
by multiple sessions will cause difficulties. Permissions are controlled 
by the file permissions (I think). A good discussion of whether SQLite 
is appropriate or not is provided at www.sqlite.org/whentouse.html. To 
compile, the source package compiling tools need to be installed before 
the R package is installed. The R package is RSQLite.

MySQL (www.mysql.com/) requires a server, which is installed and set up 
separately from the client. An R package Mac binary is available from 
CRAN but the Windows binaries are no longer supported (see 
cran.at.r-project.org/bin/windows/contrib/r-release/ReadMe). The client 
interface library binaries need to be installed. To compile, the source 
package compiling tools need to be installed before the R package is 
installed. The R package is RMySQL.

PostgreSQL (postgresql.org/) requires a server, which is installed and 
set up separately from the client.  R package Windows and Mac binaries 
are available from CRAN. The client interface library binaries need to 
be installed. To compile, the source package compiling tools need to be 
installed before the R package is installed. The R package is RPostgreSQL.

Oracle (www.oracle.com) is at the "heavy duty" end of the spectrum. It 
seems unlikely that one would choose this without having a specialized 
database administrator to install and set up the database and set user 
credentials.  R package Windows and Mac binaries are not available from 
CRAN (see cran.at.r-project.org/web/packages/ROracle/index.html). The 
client interface library binaries need to be installed and, to compile, 
the source package compiling tools need to be installed before the R 
package is installed. The R package is ROracle.

The choice of SQLite or Oracle may be relatively straightforward, but 
between MySQL and PostgreSQL the choice is more difficult. There has 
been considerable convergence and, apparently, many of the historical 
differences no longer exist. One comparison is available at
www.wikivs.com/wiki/MySQL_vs_PostgreSQL, and there are undoubtedly 
others. In the end, the choice might come down to options for purchasing 
support, or a preferred philosophy about the development model. The R 
packages are both mature and well supported (by volunteers). Your choice 
might be influenced more by the server side considerations than by R 
client considerations. Both have been more than adequate for my 
purposes. One difference I found, when loading csv data directly on the 
server is that MySQL allowed more incorrectly formatted data (e.g. 
dates). That is, it made more guesses, sometimes correctly and sometimes 
not. The good guesses may be consider a feature, but PostgreSQL may be a 
better option if you consider the loading of csv data to be part of the 
data cleaning process. This will usually not be a consideration if you 
are loading the data from R. (For what it is worth, my recommendation is 
to use both and pay attention to writing standard SQL so you can switch 
back and forth easily - but I know that people do not always follow my 
recommendations.)

Another option is ODBC (see http://en.wikipedia.org/wiki/ODBC). ODBC 
provides a middleware API which can be useful for further standardizing 
your interface to the database. The main reason for using this is likely 
to be that it is the interface supported by a database you need to 
access. The server and client will need ODBC layers in place of, or in 
addition to, their usual interfaces. ODBC has its own configuration. If 
you are setting up the database yourself this is likely to be more 
rather than less complicated. You still need to install and set up the 
server ( MySQL, PostgreSQL, or Oracle). The R package is RODBC.

Installation and setup are beyond the scope of this post. Perhaps 
someone else can attempt that or point to instructions elsewhere. Just 
beware that you need to distinguish installation and setup of the client 
from installation and setup of the server, and you may need both. There 
are probably other lists with better discussions of the server side, 
since it really is not specific to R.

Paul