[R-sig-DB] RMySQL and factors
Jim McLoughlin
j|mmc|ough||n @end|ng |rom e@rth||nk@net
Fri Feb 4 00:08:55 CET 2005
>> This has the potential problem that if the data changes or the sql
>> statement has
>> a WHERE in it, each run might get slightly different levels().
>
>
> I don't think that presents a problem. Either you know from the
> experimental design what all of the possible levels are, or you don't.
>
> If you don't, there is no way to make sure that all will be present
> every time the function is used., no matter whether a 'where' clause
> is added or not, or whether or not the table changes.
>
> If you do know what the possible levels are, you can explicitly list
> them in the call to factor(). See ?factor.
>
> If at any given point in time the database table is considered to have
> all the possible levels present, and you want to have all of them
> present in the factor variable in R, even when not all levels are
> retrieved from the database, it's still easy:
To expand on the above, one option is to make the database design
accurately reflect the experiment:
Have two tables instead of one:
Patient (int patient_id, char* name, int race_id)
Race(int race_id, char* race_name)
Race contains all possible factor levels, and might be useful elsewhere
in the experiment/project. Patient reflects all observed patient
instances, which may or may not include all race levels. You can
modify the query to something like:
getAllRaces <- function(connection) {
sql <- "SELECT race_name FROM RACE"
rs <- dbGetQuery(connection, sql)
allRaces <- rs$race_name
}
getRaceAsFactors <- function(connection) {
sql <- "SELECT t2.race_name FROM PATIENT t1, RACE t2 where t1.race_id
= t2.race_id"
rs <- dbGetQuery(connection, sql)
allRaces <- getAllRaces(connection)
raceFactors <- factor(rs$race_name, levels=allRaces)
}
of course, making a second database table may be overkill depending on
the size of your project, and how often all race levels are needed in
your code.
Jim
More information about the R-sig-DB
mailing list