[R-sig-DB] RMySQL and factors

Fri Feb 4 00:08:55 CET 2005

>> This has the potential problem that if the data changes or the sql 
>> statement has
>> a WHERE in it, each run might get slightly different levels().
>
>
> I don't think that presents a problem. Either you know from the 
> experimental design what all of the possible levels are, or you don't.
>
> If you don't, there is no way to make sure that all will be present 
> every time the function is used., no matter whether a 'where' clause 
> is added or not, or whether or not the table changes.
>
> If you do know what the possible levels are, you can explicitly list 
> them in the call to factor(). See ?factor.
>
> If at any given point in time the database table is considered to have 
> all the possible levels present, and you want to have all of them 
> present in the factor variable in R, even when not all levels are 
> retrieved from the database, it's still easy:

To expand on the above, one option is to make the database design 
accurately reflect the experiment:

Have two tables instead of one:
Patient (int patient_id, char* name, int race_id)
Race(int race_id, char* race_name)

Race contains all possible factor levels, and might be useful elsewhere 
in the experiment/project.  Patient reflects all observed patient 
instances, which may or may not include all race levels.  You can 
modify the query to something like:

getAllRaces <- function(connection) {
	sql <- "SELECT race_name FROM RACE"
	rs <- dbGetQuery(connection, sql)
	allRaces <- rs$race_name
}

getRaceAsFactors <- function(connection) {
	sql <- "SELECT t2.race_name FROM PATIENT t1, RACE t2 where t1.race_id 
= t2.race_id"
	rs <- dbGetQuery(connection, sql)
	allRaces <- getAllRaces(connection)
	raceFactors <- factor(rs$race_name, levels=allRaces)
  }

of course, making a second database table may be overkill depending on 
the size of your project, and how often all race levels are needed in 
your code.

Jim