[Rd] generic database access methods
Timothy H. Keitt
Timothy.Keitt@StonyBrook.Edu
Mon, 02 Apr 2001 17:39:02 -0400
David James wrote:
>
> I think that implicitly there's a class "dbConnection"
> that the various implementations RPgSQL, RMySQL, RODBC, ROracle, etc.,
> would inherit from (extend). Right? Having the various implementations
> inherit from a common base class, say, "dbConnection", could be
> useful for grouping common "database connection" functionality --
> for instance, imagine we want to implement some functionality
> on top of the common Rdbi, and thus for *any* DBMS. I could easily
> imagine moving your idea of proxy data.frames from the RPgSQL
> package to, say, proxyTable, so that it can be used with ROracle, RODBC,
Yes. Exactly. My initial proposal was intentionally simple; I've been
experimenting with having connection objects inherit from a common base
class. (I make some commments about proxies below.)
> etc. Another example would be the ability to attach(conn) and then
> have R manage the reading/writing of tables, e.g.,
>
> Example:
> (See John M. Chambers "Data Management in S" in
> http://stat.bell-labs.com/stat/doc/doc/93.15.ps.gz and
> and "Database Classes" http://stat.bell-labs.com/stat/Sbook/DatabaseClasses)
>
> > con <- dbConnect("RPgSQL", ...)
> > attach(con)
> > ls(pos=2)
> [1] table1 table2 ...
> > median(table1$fieldX)
> > exists("table2", where=2)
> > remove("table4", where=2)
> > assign("newTable4", my.results, where = 2)
This is also one of the things I've been thinking about. However, there are some
problems matching database tables to data frames. Data frames assume deterministic
random access to rows, but databases generally do not. The backend is free to
change the order of the rows at any time after receiving a query and before returning
the result. So to effectively proxy db tables, you have to enforce row ordering
explicitely by adding a "row_number" column and update it on every insertion and
deletion (unless of course,the data frame model were changed to only allow sequential,
cursor based access---but I'll save that for a different proposal!). As a result,
it may make sense to limit "attaching" of databases only to databases where R has
control over the schema, i.e., the database is created in R and only accessed through
R (and so can force a row number column on each table). If you allow attachment of
arbitrary databases, it will be easy to corrupt the row ordering.
> We could implement this for any DBMS *provided* we define
> the various R methods read/write/exists methods in terms of the
> base class "dbConnection" and not in terms of the individual
> RMySQL, RODBC, RPgSQL (otherwise each new R<dbms> driver would be required
> to implement the attached methods, the proxyTable methods, ....)
>
> Following the approach above, it would be very simple to attach
> a DBMS and treat *all* the tables as db.proxy independently of the
> actual DBMS (provided the methods are, again, defined in terms of
> the common base "dbConnection").
>
Absolutely; we just have to define a minimal set of primitives needed to support
proxying (but see comments above).
>
> I would suggest that, if the the query generates a result, dbSendQuery
> retuns a "result set" or "cursor" object. Imagine the result is
> huge -- that could easily kill the R process. This object ("cursor" or
> "result set") could have some methods to fetch(cur, n = num.rows),
> plus some self-description to show its state -- e.g. to show the
> query being procesed, plus methods to tell us how many rows we've
> fetched so far, etc. Even if dbSendQuery() does not generate output,
> we may still want to have some way of querying what happen with the
> operation; for instance, how many rows were affected by the operation
>
> > rs <- dbSendQuery(conn, "delete from TABLE1 where id = 1234")
> > if(rowsAffected(rs) != 1){
> ## oops, something isn't quite right!
> + rollback(conn)
> + }
>
> In many situation, users may not want to bother with this level
> of detail -- all they may want is to get the data. For this I'd follow
> the ODBC(?) "execute immediate" idea and implement a very similar
> method for those cases when you know want to do the "dbSendQuery"
> and "fetch" in one scoop, say "dbExec" or something like it.
>
I was avoiding "result" objects based on the argument that its better to
use normal R operations to access data whenever possible (so users don't
have to learn yet another data interface). However, I agree with your
argument here and adding result objects is easily done. Having two
functions as in RODBC is a good solution. Alternatively, we could
have "as.data.frame.Rdbi.conn <- function(connObj)..."
> Miscellaneous
> -------------
>
> The following are some issues that have been brought up
> either through private conversations and/or during the DSC 2001
> workshop.
>
> 1. Data Type Conversions:
>
> Another set of generic functions that we may need to consider
> could include functions to specify data conversions: for instance,
> a generic getDataType(dbObj, s.atomic.object) that would return the
> appropriate data type on the remote DBMS associated with "conn".
> For instance, if "ora" and "pg" are Oracle and PostgreSQL connections,
> respectively, the expressions
>
> > x.int <- as.integer(x)
> > ora.int <- getDataType(ora, x.int)
> > pg.int <- getDataType(pg, x.int)
>
> would return possibly different DBMS-specific SQL data type for the same
> integer (and possibly different for the same DBMS, in the case
> of R and Splus, as Brian Ripley has pointed out, I believe in the
> R Data Import/Export manual(?)).
>
> We would also need the reverse DBMS type -> R/S object.
>
> Finally, there'll be times that we cannot automatically convert
> non-primitive objects (images? sound?) to/from DBMS and R/S -- we
> should have a mechanism to allow user to specify their own conversion
> functions. We need to think of a "user-friendly" mechanism to do this.
I made a stab at this in RPgSQL. See "types.R" Also, do we want to do
column and table name conversion? I do this in RPgSQL, but I'm finding
that it can lead to confusing situations.
>
> 2. DBMS vs R/S identifiers:
>
> Do we need/want to handle for the user the mapping between
> remote identifiers (e.g., "TABLE_ABC") and R/S? This could
> be easily implemented using the ideas in data mapping ideas above.
>
> 3. Transactions:
>
> Do we want to define generic functions to manage transactions?
Probably not. This should be transparent to users. Do you have
examples of where it is needed? (Again, I'm trying to avoid transfering
the entire SQL programming interface into R and only provide the bits
needed to move data back and forth.)
>
> 4. Remote Objects:
>
> If we decide that having "cursor/result set" objects is desirable,
> then we should note that they are similar to "dbConnection" objects
> in the sense that they are "remote object," i.e., R objects that reference
> remote objects, (remote db object, to be more precise). All these remote
> objects may share some common functionality: all remote objects need
> to be validated before they are passed to the DBMS (e.g., dbConnectionOk),
> we need to ensure that we free the (remote) resources even in the
> case of errors/exceptions, perhaps through judicious use of on.exit()
> (see also Luke Tierney's "Simple References with Finalization"
> http://developer.r-project.org/simpleref.html, but keep in mind that
> this may/may not be portable to S-plus.)
I'll take a look. It sounds useful. (Oops, I couldn't resolve the URL.)
> 5. Naming convention:
>
> We could use (as you are suggesting) a naming convention consisting
> of concatenated words with capitalization for non-starting
> words, e.g., dbDisconnect). The traditional R/S way is to separate
> words with ".", e.g., "read.table". I don't have too strong a preference,
> but perhaps we should be explicit about it?
>
I'm agnostic here. I've been using my experimental Rdbi package for the last
week or so and find that I like the "dbConnect" naming, but I would go with
whatever people wanted.
I have a package ready to distribute for testing. Its actually a package cluster
with "Rdbi.base" and "Rdbi.PgSQL." I've moved the main read loop out of R and
into C and now get around 200K integer reads a second (includes copying results
into a data frame). I'm happy for someone to take over from here (or to propose
an alternative framework). Perhaps we can make it a group effort (transfer copyright
to the R project, etc.) I would suggest that we start with the absolute minimal
interface and add features very carefully. It would be great to see a general
database framework make it into a future release of R.
Tim
--
Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Phone: 631-632-1101, FAX: 631-632-7626
http://life.bio.sunysb.edu/ee/keitt/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._