[Rd] joining columns as in a relational database
Marc Schwartz
MSchwartz at medanalytics.com
Wed Jun 25 01:24:55 MEST 2003
On Tue, 2003-06-24 at 18:07, Douglas Bates wrote:
> In our recent workshop on "Multilevel Modeling in R" we discussed
> handling data for multilevel modeling. An classic example of such
> data are test scores of students grouped into schools. We may wish to
> model the scores as functions of both student-level covariates and
> school-level covariates.
>
> Such data are often organized in a multi-table format with a separate table
> for each level of information. The MathAchieve and MathAchSchool data
> frames in the nlme package are examples of such an organization. The
> HLM software requires the data to be organized like this. To fit a
> model in R we need to create a composite table by "joining" the
> columns of the student-level and school-level tables, in the
> relational database sense of "join".
>
> I have created a function to join the columns from two such frames
> according to the values of a key column. In relational database terms
> the key column must be a primary key for the second frame. I have
> called this function 'cjoin', by analogy to cbind.
>
> You can try
>
> data(MathAchieve, package = 'nlme')
> data(MathAchSchool, package = 'nlme')
> cjoin(MathAchieve, MathAchSchool, "School")
> cjoin(MathAchieve, MathAchSchool, "School", which = "Sector")
>
> as examples
>
> Several questions:
>
> - Am I duplicating existing functionality?
>
> - Is cjoin a good name for such a function?
>
> - Would this be useful in base?
Prof. Bates,
Perhaps I am not seeing all of the details but a quick (perhaps too
quick) review would suggest that merge(), which is in base, would at
least be a parallel function.
For example:
data(MathAchieve, package = 'nlme')
data(MathAchSchool, package = 'nlme')
cj1 <- cjoin(MathAchieve, MathAchSchool, "School")
mrg1 <- merge(MathAchieve, MathAchSchool, by = "School")
dim(cj1)
[1] 7185 12
dim(mrg1)
[1] 7185 12
colnames(cj1)
[1] "School" "Minority" "Sex" "SES" "MathAch" "MEANSES"
[7] "Size" "Sector" "PRACAD" "DISCLIM" "HIMINTY" "MEANSES"
colnames(mrg1)
[1] "School" "Minority" "Sex" "SES" "MathAch"
"MEANSES.x"
[7] "Size" "Sector" "PRACAD" "DISCLIM" "HIMINTY"
"MEANSES.y"
> table(cj1 == mrg1)
TRUE
86220
Note that cj1 appears to have duplicate colnames for "MEANSES", for
which your function does issue a warning message, whereas mrg1 has
appended ".x" and ".y" as default suffixes.
In the second case, if I am reading the code correctly, the 'which'
argument appears to restrict the columns retained from the second data
frame after the key match, in this case "Sector".
> cj2 <- cjoin(MathAchieve, MathAchSchool, "School", which = "Sector")
> mrg2 <- mrg1[, c(1:6, 8)]
> dim(cj2)
[1] 7185 7
> dim(mrg2)
[1] 7185 7
> colnames(cj2)
[1] "School" "Minority" "Sex" "SES"
[5] "MathAch" "MEANSES" "fr2[mm, which]"
> colnames(mrg2)
[1] "School" "Minority" "Sex" "SES" "MathAch"
"MEANSES.x"
[7] "Sector"
> table(cj2 == mrg2)
TRUE
50295
See ?merge for more information.
Thoughts?
Best regards,
Marc Schwartz
More information about the R-devel
mailing list