[R] Contatenating data frames with partial overlap in variable names

Sun Mar 25 05:13:44 CEST 2007

On Sat, 2007-03-24 at 22:16 -0400, Daniel Folkinshteyn wrote:
> on 03/24/2007 10:00 PM Marc Schwartz said the following:
> > On Sat, 2007-03-24 at 21:47 -0400, Daniel Folkinshteyn wrote:
> >> Greetings to all.
> >> I need to concatenate data frames that do not have all the same variable
> >> names, there is only a partial overlap in the variables. So, for
> >> example, if i have two data frames, a and b, that look like the following:
> >>> a
> >>   a b
> >> 1 1 4
> >> 2 2 5
> >> 3 3 6
> >> 4 4 7
> >> 5 5 8
> >>> b
> >>   c  a
> >> 1 1 10
> >> 2 2 11
> >> 3 3 12
> >> 4 4 13
> >> 5 5 14
> >>
> >> i want to concatenate them by row, without any matching, so that the
> >> variables that are not available in all frames get NAs. The result
> >> should look like:
> >>
> >>    a  b  c
> >> 1  1  4  NA
> >> 2  2  5  NA
> >> 3  3  6  NA
> >> 4  4  7  NA
> >> 5  5  8  NA
> >> 6  10 NA 1
> >> 7  11 NA 2
> >> 8  12 NA 3
> >> 9  13 NA 4
> >> 10 14 NA 5
> >>
> >> rbind doesn't work, since it requires all variables to be matched
> >> between the two data frames. merge doesn't work, since it wants to
> >> /match/ by columns with the same name, and if matching by nothing,
> >> produces a cartesian product.
> >>
> >> is there a neat trick for doing this simply, or am i stuck with
> >> comparing variable lists and generating NAs manually?
> >>
> >> would appreciate any help!
> >> Daniel
> > 
> > You can use merge():
> > 
> >> a
> >   a b
> > 1 1 4
> > 2 2 5
> > 3 3 6
> > 4 4 7
> > 5 5 8
> > 
> >> b
> >   c  a
> > 1 1 10
> > 2 2 11
> > 3 3 12
> > 4 4 13
> > 5 5 14
> > 
> > 
> > Use 'a' as the common 'by' column and specify 'all = TRUE' so that
> > non-matching values of 'a' will be included in the result:
> > 
> > 
> >> merge(a, b, by = "a", all = TRUE)
> >     a  b  c
> > 1   1  4 NA
> > 2   2  5 NA
> > 3   3  6 NA
> > 4   4  7 NA
> > 5   5  8 NA
> > 6  10 NA  1
> > 7  11 NA  2
> > 8  12 NA  3
> > 9  13 NA  4
> > 10 14 NA  5
> > 
> Thanks for your quick response. Unfortunately, this is still not quite
> what I have in mind (though maybe it's my fault for not making this too
> clear). Even if the two data frames happen to have some values of 'a'
> that match, I still want those records to remain separate, rather than
> merge. So, for instance, using merge will produce the following:
> > a = data.frame(a=1:5, b=4:8)
> > a
>   a b
> 1 1 4
> 2 2 5
> 3 3 6
> 4 4 7
> 5 5 8
> > b = data.frame(c=1:5, a=4:8)
> > b
>   c a
> 1 1 4
> 2 2 5
> 3 3 6
> 4 4 7
> 5 5 8
> > merge(a,b,by='a',all=T)
>   a  b  c
> 1 1  4 NA
> 2 2  5 NA
> 3 3  6 NA
> 4 4  7  1
> 5 5  8  2
> 6 6 NA  3
> 7 7 NA  4
> 8 8 NA  5
> 
> whereas I would still want it to produce 10 separate rows, because they
> are separate observations, it's just that one of them happens to be
> missing a variable.

OK. Not sure if this is the most efficient way of doing this, but this
seems to work, though through very limited testing.

Basically what I am doing is using setdiff() to figure out which columns
are not common between the two data frames. In each case, I then use
sapply() to loop over the results, creating a new column of NA's that
will be cbind()ed back to the original data frame.

Once that is done, the two new data frames, a.2 and b.2, will have
common columns and they can then be rbind()ed.

a.2 <- cbind(a, sapply(setdiff(colnames(b), colnames(a)), 
                       function(x) x = rep(NA, nrow(a))))

b.2 <- cbind(b, sapply(setdiff(colnames(a), colnames(b)), 
                       function(x) x = rep(NA, nrow(b))))

> a.2
  a b  c
1 1 4 NA
2 2 5 NA
3 3 6 NA
4 4 7 NA
5 5 8 NA

> b.2
  c a  b
1 1 4 NA
2 2 5 NA
3 3 6 NA
4 4 7 NA
5 5 8 NA

> rbind(a.2, b.2)
   a  b  c
1  1  4 NA
2  2  5 NA
3  3  6 NA
4  4  7 NA
5  5  8 NA
6  4 NA  1
7  5 NA  2
8  6 NA  3
9  7 NA  4
10 8 NA  5

Hopefully that will work in more general cases, but I would validate
that.

HTH,

Marc Schwartz