[R-sig-DB] Re: dbSetDataMappings with DBI.RODBC

Saikat DebRoy @@|k@t @end|ng |rom @t@t@w|@c@edu
Tue Jan 22 18:32:31 CET 2002


>>>>> "David" == David James <dj using research.bell-labs.com> writes:

  David> One problem with this approach is that fetch() may be invoked repeatedly 
  David> to fetch subsets of the result set, so the levels built at each invocation
  David> will only include levels that appear during that fetch.  One could think
  David> of using closures as converters to better manage data across invocations
  David> of fetch(), but it doesn't seem very clean.

  David> Doug Bates, Tim Keitt, Frank Harrell, and others have already brought up 
  David> some of these issues in this list precisely in the context of factors -- 
  David> e.g., you may need to store the levels of a factor outside the table where 
  David> the factor itself resides.  I believe MS SQL server has facilities to 
  David> define new types -- it may be worthwhile looking into that...

I am not familiar with MS SQL server but I have experimented with various
ways of defining a factor type in Postgresql. The simplest (but
storage wise expensive) way to do this is define a new type which
internally has two fields - an integer (representing the level of this
element of the factor) and a VARCHAR array which represents the labels
of the factor. The problem with this is that the labels vector has to
be stored in each row of the factor column and there is no way to
ensure that the labels in each row are identical. A better approach
might be to store the labels externally in another table with two
columns. The first column would have the level numbers from 1 to n - n
being total number of levels. The second column would have the
labels. For example -

create table cars (
 speed integer, -- this is an ordered factor
 dist integer
);

create table cars_speed_levels (
 level integer,
 label varchar
);

You can view the table cars with factor labels by using a join:
select * from cars, cars_speed_levels where cars.speed = cars_speed_levels.level;

When you are fetching a result from within R, you would have to use a
version of fetch which accepts names of the level tables. Here is a
modified version of David's example (with his overloaded fetch method).

res <- dbSendQuery(con, "select * from cars")  ## cars is from data(cars)
d <- fetch(res, post = list(speed = function(x, con)
                                    {
                                        levels <- fetch(dbSendQuery(con, "select label from cars_speed_levels order by level")
                                        ordered(x, levels = levels)
                                    })

The main advantage of this approach is that it follows the principle
of normalizing database tables. It also uses minimal storage. You do
have to create extra tables for each factor in each table in the
database. Also note that you have to supply the connection to the
function so that it can fetch the labels from the database.

Saikat
 -- 
Department of Statistics                       Email: saikat using stat.wisc.edu
University of Wisconsin - Madison              Phone: (608) 263 5948
1210 West Dayton Street                        Fax:   (608) 262 0032
Madison, WI 53706-1685




More information about the R-sig-DB mailing list