[Rd] joining columns as in a relational database

Wed Jun 25 01:24:55 MEST 2003

On Tue, 2003-06-24 at 18:07, Douglas Bates wrote:
> In our recent workshop on "Multilevel Modeling in R" we discussed
> handling data for multilevel modeling.  An classic example of such
> data are test scores of students grouped into schools.  We may wish to
> model the scores as functions of both student-level covariates and
> school-level covariates.
> 
> Such data are often organized in a multi-table format with a separate table
> for each level of information.  The MathAchieve and MathAchSchool data
> frames in the nlme package are examples of such an organization.  The
> HLM software requires the data to be organized like this.  To fit a
> model in R we need to create a composite table by "joining" the
> columns of the student-level and school-level tables, in the
> relational database sense of "join".
> 
> I have created a function to join the columns from two such frames
> according to the values of a key column.  In relational database terms
> the key column must be a primary key for the second frame.  I have
> called this function 'cjoin', by analogy to cbind.
> 
> You can try
> 
> data(MathAchieve, package = 'nlme')
> data(MathAchSchool, package = 'nlme')
> cjoin(MathAchieve, MathAchSchool, "School")
> cjoin(MathAchieve, MathAchSchool, "School", which = "Sector")
> 
> as examples
> 
> Several questions:
> 
>  - Am I duplicating existing functionality?
> 
>  - Is cjoin a good name for such a function?
> 
>  - Would this be useful in base?

Prof. Bates,

Perhaps I am not seeing all of the details but a quick (perhaps too
quick) review would suggest that merge(), which is in base, would at
least be a parallel function.

For example:

data(MathAchieve, package = 'nlme')
data(MathAchSchool, package = 'nlme')

cj1 <- cjoin(MathAchieve, MathAchSchool, "School")
mrg1 <- merge(MathAchieve, MathAchSchool, by = "School")

dim(cj1)
[1] 7185   12

dim(mrg1)
[1] 7185   12

colnames(cj1)
 [1] "School"   "Minority" "Sex"      "SES"      "MathAch"  "MEANSES"
 [7] "Size"     "Sector"   "PRACAD"   "DISCLIM"  "HIMINTY"  "MEANSES"

colnames(mrg1)
 [1] "School"    "Minority"  "Sex"       "SES"       "MathAch"  
"MEANSES.x"
 [7] "Size"      "Sector"    "PRACAD"    "DISCLIM"   "HIMINTY"  
"MEANSES.y"

> table(cj1 == mrg1)

 TRUE
86220

Note that cj1 appears to have duplicate colnames for "MEANSES", for
which your function does issue a warning message, whereas mrg1 has
appended ".x" and ".y" as default suffixes.

In the second case, if I am reading the code correctly, the 'which'
argument appears to restrict the columns retained from the second data
frame after the key match, in this case "Sector".

> cj2 <- cjoin(MathAchieve, MathAchSchool, "School", which = "Sector")
> mrg2 <- mrg1[, c(1:6, 8)]

> dim(cj2)
[1] 7185    7

> dim(mrg2)
[1] 7185    7

> colnames(cj2)
[1] "School"         "Minority"       "Sex"            "SES"
[5] "MathAch"        "MEANSES"        "fr2[mm, which]"

> colnames(mrg2)
[1] "School"    "Minority"  "Sex"       "SES"       "MathAch"  
"MEANSES.x"
[7] "Sector"

> table(cj2 == mrg2)

 TRUE
50295

See ?merge for more information.

Thoughts?

Best regards,

Marc Schwartz