[Rd] joining columns as in a relational database

Douglas Bates bates at stat.wisc.edu
Wed Jun 25 00:07:20 MEST 2003


In our recent workshop on "Multilevel Modeling in R" we discussed
handling data for multilevel modeling.  An classic example of such
data are test scores of students grouped into schools.  We may wish to
model the scores as functions of both student-level covariates and
school-level covariates.

Such data are often organized in a multi-table format with a separate table
for each level of information.  The MathAchieve and MathAchSchool data
frames in the nlme package are examples of such an organization.  The
HLM software requires the data to be organized like this.  To fit a
model in R we need to create a composite table by "joining" the
columns of the student-level and school-level tables, in the
relational database sense of "join".

I have created a function to join the columns from two such frames
according to the values of a key column.  In relational database terms
the key column must be a primary key for the second frame.  I have
called this function 'cjoin', by analogy to cbind.

You can try

data(MathAchieve, package = 'nlme')
data(MathAchSchool, package = 'nlme')
cjoin(MathAchieve, MathAchSchool, "School")
cjoin(MathAchieve, MathAchSchool, "School", which = "Sector")

as examples

Several questions:

 - Am I duplicating existing functionality?

 - Is cjoin a good name for such a function?

 - Would this be useful in base?

-------------- next part --------------
"cjoin" <-
    function(fr1, fr2, cnm1, cnm2 = cnm1, which)
{
    val = as.data.frame(fr1)
    vnms = names(val)
    cnm1 = as.character(cnm1)
    if (length(cnm1) < 1 || any(is.na(match(cnm1, vnms)))) {
        stop("cnm1 must be one or more column names from fr1")
    }
    fr2nm = names(fr2)
    if (!missing(cnm2)) {
        cnm2 = as.character(cnm2)
        if (length(cnm2) != length(cnm1) || any(is.na(match(cnm2, fr2nm)))) {
            stop(paste("cnm2 must be", length(cnm1),
                       "column name(s) from fr2"))
        }
    } else {
        if (any(missed <- is.na(match(cnm2, fr2nm)))) {
            stop(paste("No columns named", paste(cnm2[missed], sep = ", "),
                       "in fr2"))
        }
    }
    if (length(cnm1) == 1) {
        mcol = fr2[[cnm2]]
        if (any(is.na(mcol))) {
            warning(paste("Missing values in column", mcol,
                          "of fr2 cannot be matched"))
            mcol = mcol[is.na(mcol)]
        }
        if (length(mcol) != length(unique(mcol))) {
            stop(paste("column", mcol, "must be a unique key to fr2"))
        }
        mm = match(fr1[[cnm1]], fr2[[cnm2]])
        if (any(is.na(mm))) {
            stop("Values present in column", cnm1,
                 "of fr1 are not present in column", cnm2, "of fr2")
        }
        if (missing(which)) {
            which = fr2nm[!(fr2nm %in% cnm2)]
        } else {
            which = which[!(which %in% cnm2)]
            if (any(missed <- is.na(match(which, fr2nm)))) {
                stop(paste("No columns named", paste(which[missed], sep = ", "),
                           "in fr2"))
            }
        }
        if (any(dups <- !is.na(match(which, vnms)))) {
            warning(paste("Join operation will duplicate column name(s)",
                          paste(which[dups], sep = ", ")))
        }
        return(cbind(val, fr2[mm, which]))
    } else {
        stop("Matches on multiple columns not yet implemented")
    }
}


More information about the R-devel mailing list