[R-sig-DB] Basic R to SQL considerations
Paul Gilbert
pg||bert902 @end|ng |rom gm@||@com
Tue Feb 25 04:45:13 CET 2014
(This is prompted by an R-help post entitled '[R] Problem connecting to
database via RPostgreSQL/RS-DBI: "could not connect" error' but this
post is not specific to RPostgreSQL)
The general questions in the above R-help post reminded me of problems I
had when starting to use SQL databases in conjunction with R. Below I
try to provide some general comments and guidance with respect to
options for using SQL databases from R. The main purpose of this post is
so than anyone just starting to use R with SQL may find something to
help with some of the initial decisions they must make. I hope that
others will respond by posting additional detail where that seems
warranted. My knowledge in some parts is limited.
In general SQL databases operate on a client/server model. R is a client
program and will use the client interface of the database. The client
interface drivers typically must be installed separately and prior to
installing the R package that uses the client interface. Most client
interfaces also include a standalone client (e.g. sqlite, mysql, psql).
Among other things, this is useful for ensuring your client machine can
connect to the database you intend to use, as it does not rely on R or
any R packages. Before concluding that the R package does not work, you
should ensure that the stand alone package does work.
You will also need a database to talk to, which means (excepting SQLite)
a computer with the server software installed and set up, a database
defined, and appropriate privileges for your client machine and userid.
Often an R user will not need to do this because the server and database
is provided by someone else (e.g. a "corporate" database within an
organization.) When the database is provided, the choice of which R
client package to use will be dictated by that. When the R user must
also set up the database, then the main hurdle is likely to be the
server side rather than the client side. (The user may need to learn a
fair amount about system administration to install the server, and then
a certain amount about database administration to set up the database
with appropriate privileges.)
It is not unusual for the server to be a Unix or Linux machine and the
client to have a different operating system, like Windows or Mac. In
fact, this may be the most common situation. The SQL interfaces handle
the cross-platform issues between server and client.
The notion of a "totally local" database fits in the SQL context in a
way that will be unfamiliar to users of standalone programs (but see
further below about SQLite). Even for a local database, the server must
still be running, in this case on the same machine as the client, and
must be listening for requests from client applications. That means it
is configured to listen on a local socket, the loopback interface, or
the IP address of the local machine, and recognizes the userid/password
or some other security mechanism. This server can still respond to
multiple client applications running simultaneously, for example, two R
sessions, and to multiple users on the machine. So this is not "totally
local" in the sense that many people may think of that term.
A message like "could not connect" or "Can't connect" in a client
application indicates that the client cannot talk to the server. It is
unlikely that you will get this far if the client software is not
installed properly. The problem is that the server is not running, not
connected to a network the client can reach, blocked by a firewall, not
configured to listen for the client on the interface that the client is
trying to use, or the client username/password or other authentication
mechanism is not recognized by the server. The last problem can be
either that the client is not using the proper credentials or the server
is not set up properly to recognize the credentials.
You should get the same "could not connect" message if you run the
standalone client on the same machine. If you don't, then you are using
different credentials with the standalone client than you are from the R
session and that is the problem you need to resolve. You will probably
need to talk with your system administrator or database administrator if
you are trying to connect to a "corporate" database. In this case it is
usually best to report the errors as produced by the standalone client.
Otherwise you are likely to confuse the administrator into thinking the
problem is in R. (And in many cases that will mean they then consider it
is your problem rather than their problem.)
It is sometimes possible to work around a server setup that does not
allow remote connections by setting up an ssh tunnel, so the connection
appears to the server as if it is local. If the connection is over the
Internet this might also add some additional security in the transmission.
***** Relative strengths of different options *****
As mentioned above, if you are trying to connect to a database provided
by someone else then your choice of which client to use is already
decided, and you do not need to consider this section.
SQLite (www.sqlite.org/) is by far the easiest to set up. Installation
is almost automatic. R package Windows and Mac binaries are available
from CRAN. If you are thinking "totally local" then this is very likely
the option to choose. It is "serverless". The database is in a file
which could be read by more than one session simultaneously, but writing
by multiple sessions will cause difficulties. Permissions are controlled
by the file permissions (I think). A good discussion of whether SQLite
is appropriate or not is provided at www.sqlite.org/whentouse.html. To
compile, the source package compiling tools need to be installed before
the R package is installed. The R package is RSQLite.
MySQL (www.mysql.com/) requires a server, which is installed and set up
separately from the client. An R package Mac binary is available from
CRAN but the Windows binaries are no longer supported (see
cran.at.r-project.org/bin/windows/contrib/r-release/ReadMe). The client
interface library binaries need to be installed. To compile, the source
package compiling tools need to be installed before the R package is
installed. The R package is RMySQL.
PostgreSQL (postgresql.org/) requires a server, which is installed and
set up separately from the client. R package Windows and Mac binaries
are available from CRAN. The client interface library binaries need to
be installed. To compile, the source package compiling tools need to be
installed before the R package is installed. The R package is RPostgreSQL.
Oracle (www.oracle.com) is at the "heavy duty" end of the spectrum. It
seems unlikely that one would choose this without having a specialized
database administrator to install and set up the database and set user
credentials. R package Windows and Mac binaries are not available from
CRAN (see cran.at.r-project.org/web/packages/ROracle/index.html). The
client interface library binaries need to be installed and, to compile,
the source package compiling tools need to be installed before the R
package is installed. The R package is ROracle.
The choice of SQLite or Oracle may be relatively straightforward, but
between MySQL and PostgreSQL the choice is more difficult. There has
been considerable convergence and, apparently, many of the historical
differences no longer exist. One comparison is available at
www.wikivs.com/wiki/MySQL_vs_PostgreSQL, and there are undoubtedly
others. In the end, the choice might come down to options for purchasing
support, or a preferred philosophy about the development model. The R
packages are both mature and well supported (by volunteers). Your choice
might be influenced more by the server side considerations than by R
client considerations. Both have been more than adequate for my
purposes. One difference I found, when loading csv data directly on the
server is that MySQL allowed more incorrectly formatted data (e.g.
dates). That is, it made more guesses, sometimes correctly and sometimes
not. The good guesses may be consider a feature, but PostgreSQL may be a
better option if you consider the loading of csv data to be part of the
data cleaning process. This will usually not be a consideration if you
are loading the data from R. (For what it is worth, my recommendation is
to use both and pay attention to writing standard SQL so you can switch
back and forth easily - but I know that people do not always follow my
recommendations.)
Another option is ODBC (see http://en.wikipedia.org/wiki/ODBC). ODBC
provides a middleware API which can be useful for further standardizing
your interface to the database. The main reason for using this is likely
to be that it is the interface supported by a database you need to
access. The server and client will need ODBC layers in place of, or in
addition to, their usual interfaces. ODBC has its own configuration. If
you are setting up the database yourself this is likely to be more
rather than less complicated. You still need to install and set up the
server ( MySQL, PostgreSQL, or Oracle). The R package is RODBC.
Installation and setup are beyond the scope of this post. Perhaps
someone else can attempt that or point to instructions elsewhere. Just
beware that you need to distinguish installation and setup of the client
from installation and setup of the server, and you may need both. There
are probably other lists with better discussions of the server side,
since it really is not specific to R.
Paul
More information about the R-sig-DB
mailing list