[R] Contatenating data frames with partial overlap in variable names

Sun Mar 25 05:39:53 CEST 2007

On 3/24/07, Marc Schwartz <marc_schwartz at comcast.net> wrote:
> On Sat, 2007-03-24 at 22:16 -0400, Daniel Folkinshteyn wrote:
> > on 03/24/2007 10:00 PM Marc Schwartz said the following:
> > > On Sat, 2007-03-24 at 21:47 -0400, Daniel Folkinshteyn wrote:
> > >> Greetings to all.
> > >> I need to concatenate data frames that do not have all the same variable
> > >> names, there is only a partial overlap in the variables. So, for
> > >> example, if i have two data frames, a and b, that look like the following:
> > >>> a
> > >>   a b
> > >> 1 1 4
> > >> 2 2 5
> > >> 3 3 6
> > >> 4 4 7
> > >> 5 5 8
> > >>> b
> > >>   c  a
> > >> 1 1 10
> > >> 2 2 11
> > >> 3 3 12
> > >> 4 4 13
> > >> 5 5 14
> > >>
> > >> i want to concatenate them by row, without any matching, so that the
> > >> variables that are not available in all frames get NAs. The result
> > >> should look like:
> > >>
> > >>    a  b  c
> > >> 1  1  4  NA
> > >> 2  2  5  NA
> > >> 3  3  6  NA
> > >> 4  4  7  NA
> > >> 5  5  8  NA
> > >> 6  10 NA 1
> > >> 7  11 NA 2
> > >> 8  12 NA 3
> > >> 9  13 NA 4
> > >> 10 14 NA 5
> > >>
> > >> rbind doesn't work, since it requires all variables to be matched
> > >> between the two data frames. merge doesn't work, since it wants to
> > >> /match/ by columns with the same name, and if matching by nothing,
> > >> produces a cartesian product.
> > >>
> > >> is there a neat trick for doing this simply, or am i stuck with
> > >> comparing variable lists and generating NAs manually?
> > >>
> > >> would appreciate any help!
> > >> Daniel
> > >
> > > You can use merge():
> > >
> > >> a
> > >   a b
> > > 1 1 4
> > > 2 2 5
> > > 3 3 6
> > > 4 4 7
> > > 5 5 8
> > >
> > >> b
> > >   c  a
> > > 1 1 10
> > > 2 2 11
> > > 3 3 12
> > > 4 4 13
> > > 5 5 14
> > >
> > >
> > > Use 'a' as the common 'by' column and specify 'all = TRUE' so that
> > > non-matching values of 'a' will be included in the result:
> > >
> > >
> > >> merge(a, b, by = "a", all = TRUE)
> > >     a  b  c
> > > 1   1  4 NA
> > > 2   2  5 NA
> > > 3   3  6 NA
> > > 4   4  7 NA
> > > 5   5  8 NA
> > > 6  10 NA  1
> > > 7  11 NA  2
> > > 8  12 NA  3
> > > 9  13 NA  4
> > > 10 14 NA  5
> > >
> > Thanks for your quick response. Unfortunately, this is still not quite
> > what I have in mind (though maybe it's my fault for not making this too
> > clear). Even if the two data frames happen to have some values of 'a'
> > that match, I still want those records to remain separate, rather than
> > merge. So, for instance, using merge will produce the following:
> > > a = data.frame(a=1:5, b=4:8)
> > > a
> >   a b
> > 1 1 4
> > 2 2 5
> > 3 3 6
> > 4 4 7
> > 5 5 8
> > > b = data.frame(c=1:5, a=4:8)
> > > b
> >   c a
> > 1 1 4
> > 2 2 5
> > 3 3 6
> > 4 4 7
> > 5 5 8
> > > merge(a,b,by='a',all=T)
> >   a  b  c
> > 1 1  4 NA
> > 2 2  5 NA
> > 3 3  6 NA
> > 4 4  7  1
> > 5 5  8  2
> > 6 6 NA  3
> > 7 7 NA  4
> > 8 8 NA  5
> >
> > whereas I would still want it to produce 10 separate rows, because they
> > are separate observations, it's just that one of them happens to be
> > missing a variable.
>
> OK. Not sure if this is the most efficient way of doing this, but this
> seems to work, though through very limited testing.
>
> Basically what I am doing is using setdiff() to figure out which columns
> are not common between the two data frames. In each case, I then use
> sapply() to loop over the results, creating a new column of NA's that
> will be cbind()ed back to the original data frame.

That's pretty much what rbind.fill does, with an extra bit of error
checking and generalise for any number of matrics:

> rbind.fill
function (...)
{
    dfs <- list(...)
    if (length(dfs) == 0)
        return(list())
    all.names <- unique(unlist(lapply(dfs, names)))
    do.call("rbind", compact(lapply(dfs, function(df) {
        if (length(df) == 0 || nrow(df) == 0)
            return(NULL)
        missing.vars <- setdiff(all.names, names(df))
        if (length(missing.vars) > 0)
            df[, missing.vars] <- NA
        df
    })))
}
<environment: namespace:reshape>

>
> Once that is done, the two new data frames, a.2 and b.2, will have
> common columns and they can then be rbind()ed.
>
>
> a.2 <- cbind(a, sapply(setdiff(colnames(b), colnames(a)),
>                        function(x) x = rep(NA, nrow(a))))
>
> b.2 <- cbind(b, sapply(setdiff(colnames(a), colnames(b)),
>                        function(x) x = rep(NA, nrow(b))))
>
> > a.2
>   a b  c
> 1 1 4 NA
> 2 2 5 NA
> 3 3 6 NA
> 4 4 7 NA
> 5 5 8 NA
>
> > b.2
>   c a  b
> 1 1 4 NA
> 2 2 5 NA
> 3 3 6 NA
> 4 4 7 NA
> 5 5 8 NA
>
>
> > rbind(a.2, b.2)
>    a  b  c
> 1  1  4 NA
> 2  2  5 NA
> 3  3  6 NA
> 4  4  7 NA
> 5  5  8 NA
> 6  4 NA  1
> 7  5 NA  2
> 8  6 NA  3
> 9  7 NA  4
> 10 8 NA  5
>
>
> Hopefully that will work in more general cases, but I would validate
> that.
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>