[R] Ordered factors and DBMS

fharrell@virginia.edu fharrell at virginia.edu
Fri Oct 5 04:47:34 CEST 2001


David James wrote:
> 
> Frank E Harrell Jr wrote:
> > As someone mentioned in a previous posting, the inherent ENUM and
> > SET data types in MySQL give it an advantage in this area.
> 
> Unfortunately this is highly non-portable.  Moreover, the conversion
> is not 1-1, for both factors and ordered factors get mapped into
> MySQL's ENUM creating an ambiguity when we fetch a previously
> exported R/S object.  XML, as you point out, is more amenable for
> this type of higher-level specification.

Thanks very much for your note David.  As I plan to stick with
MySQL and to drive the database creation from the metadata in
XML the non-portability is not an issue for me.  The metadata
can also direct a "smart" export from MySQL into other
DBMS if the need ever arises.  The metadata can also provide
the original ordering of categories (which I will use mainly
as the display order on pull-down lists), and this can be
communicated to R to adjust a data frame after import.

> 
> > I am just starting to work on a project in which XML
> > Metadata specifies a MySQL database (including factor
> > levels for ENUM and SET (multiple choice instead of single)
> > fields).  This metadata will later drive some "smart"
> > import functions in R that will make such variables
> > immediately analyzable and self-documenting.  Some
> > mocked-up examples are at the end of the slides from
> > a talk I recently gave, available at
> > http://hesweb1.med.virginia.edu/biostat/dm
> >
> > I prefer to use the inherent data types rather than
> > making factor labels reside in a separate SQL table.
> 
> No problem (I think) for self-contained applications whose
> data will not need to migrate from one RDBMS to another.
> But I second Tim's suggestion that we should be thinking
> about this problem in general terms, both beyond factors and
> RDBMS.
> 
> >
> > I have only just begun to look at RMySQL documentation.
> > Does RMySQL fetch SET variables as binary integers
> > such that selected choices can be sensed once
> > the bit mapping is known?
> 
> No, currently we (naively) fetched these as character vectors,
> following the practice in Paul DuBois' "MySQL". I'm happy to
> hear any suggestion...

I think a nice solution would be to fetch those as integers
using MySQL's binary bit string representation.  This could
create a variable of class 'mchoice' for multiple choice,
with attribute mlevels containing the vector of character
strings corresponding to the MySQL bit mappings in the
64-bit integer that represents a SET variable.  The only
downside to SET to me is the 64-choice limitation.

Integer representation of ENUM types, which is consistent
with internal storage of factors in S I believe, 
will preserve the ordering of
levels so that a simple conversion to an ordered factor
would suffice.  So dispensing with character vectors for
both ENUM and SET will be a help I think, assuming
the vector of labels defined to MySQL can be referenced
during import.

Thanks,

Frank

> 
> >
> > Frank Harrell
> >
> >
> > Jason Turner wrote:
> > >
> > > Additions to Doug Bates' suggestions
> > >
> > > > ... it could be expressed naturally as two tables and a
> > > > join.  The first table would contain the data as factor levels and the
> > > > second table would give the correspondence between levels and labels.
> > >
> > > I think this is probably the most clear and elegant way to do it.
> > >
> > > > Another possibility is to define a PostgreSQL data type for this.
> > >
> > > See the PosgreSQL documentation for "CREATE TYPE".
> > >
> > > Cheers
> > >
> > > Jason
> > > --
> > > Indigo Industrial Controls Ltd.
> > > 64-21-343-545
> > > jasont at indigoindustrial.co.nz
> > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> > > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > Send "info", "help", or "[un]subscribe"
> > > (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> >
> > --
> > Frank E Harrell Jr              Prof. of Biostatistics & Statistics
> > Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
> > U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> > (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> 
> --
> David A. James
> Statistics Research, Room 2C-253            Phone:  (908) 582-3082
> Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
> Murray Hill, NJ 09794-0636

-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list