[R] Contatenating data frames with partial overlap in variable names
hadley wickham
h.wickham at gmail.com
Sun Mar 25 05:39:53 CEST 2007
On 3/24/07, Marc Schwartz <marc_schwartz at comcast.net> wrote:
> On Sat, 2007-03-24 at 22:16 -0400, Daniel Folkinshteyn wrote:
> > on 03/24/2007 10:00 PM Marc Schwartz said the following:
> > > On Sat, 2007-03-24 at 21:47 -0400, Daniel Folkinshteyn wrote:
> > >> Greetings to all.
> > >> I need to concatenate data frames that do not have all the same variable
> > >> names, there is only a partial overlap in the variables. So, for
> > >> example, if i have two data frames, a and b, that look like the following:
> > >>> a
> > >> a b
> > >> 1 1 4
> > >> 2 2 5
> > >> 3 3 6
> > >> 4 4 7
> > >> 5 5 8
> > >>> b
> > >> c a
> > >> 1 1 10
> > >> 2 2 11
> > >> 3 3 12
> > >> 4 4 13
> > >> 5 5 14
> > >>
> > >> i want to concatenate them by row, without any matching, so that the
> > >> variables that are not available in all frames get NAs. The result
> > >> should look like:
> > >>
> > >> a b c
> > >> 1 1 4 NA
> > >> 2 2 5 NA
> > >> 3 3 6 NA
> > >> 4 4 7 NA
> > >> 5 5 8 NA
> > >> 6 10 NA 1
> > >> 7 11 NA 2
> > >> 8 12 NA 3
> > >> 9 13 NA 4
> > >> 10 14 NA 5
> > >>
> > >> rbind doesn't work, since it requires all variables to be matched
> > >> between the two data frames. merge doesn't work, since it wants to
> > >> /match/ by columns with the same name, and if matching by nothing,
> > >> produces a cartesian product.
> > >>
> > >> is there a neat trick for doing this simply, or am i stuck with
> > >> comparing variable lists and generating NAs manually?
> > >>
> > >> would appreciate any help!
> > >> Daniel
> > >
> > > You can use merge():
> > >
> > >> a
> > > a b
> > > 1 1 4
> > > 2 2 5
> > > 3 3 6
> > > 4 4 7
> > > 5 5 8
> > >
> > >> b
> > > c a
> > > 1 1 10
> > > 2 2 11
> > > 3 3 12
> > > 4 4 13
> > > 5 5 14
> > >
> > >
> > > Use 'a' as the common 'by' column and specify 'all = TRUE' so that
> > > non-matching values of 'a' will be included in the result:
> > >
> > >
> > >> merge(a, b, by = "a", all = TRUE)
> > > a b c
> > > 1 1 4 NA
> > > 2 2 5 NA
> > > 3 3 6 NA
> > > 4 4 7 NA
> > > 5 5 8 NA
> > > 6 10 NA 1
> > > 7 11 NA 2
> > > 8 12 NA 3
> > > 9 13 NA 4
> > > 10 14 NA 5
> > >
> > Thanks for your quick response. Unfortunately, this is still not quite
> > what I have in mind (though maybe it's my fault for not making this too
> > clear). Even if the two data frames happen to have some values of 'a'
> > that match, I still want those records to remain separate, rather than
> > merge. So, for instance, using merge will produce the following:
> > > a = data.frame(a=1:5, b=4:8)
> > > a
> > a b
> > 1 1 4
> > 2 2 5
> > 3 3 6
> > 4 4 7
> > 5 5 8
> > > b = data.frame(c=1:5, a=4:8)
> > > b
> > c a
> > 1 1 4
> > 2 2 5
> > 3 3 6
> > 4 4 7
> > 5 5 8
> > > merge(a,b,by='a',all=T)
> > a b c
> > 1 1 4 NA
> > 2 2 5 NA
> > 3 3 6 NA
> > 4 4 7 1
> > 5 5 8 2
> > 6 6 NA 3
> > 7 7 NA 4
> > 8 8 NA 5
> >
> > whereas I would still want it to produce 10 separate rows, because they
> > are separate observations, it's just that one of them happens to be
> > missing a variable.
>
> OK. Not sure if this is the most efficient way of doing this, but this
> seems to work, though through very limited testing.
>
> Basically what I am doing is using setdiff() to figure out which columns
> are not common between the two data frames. In each case, I then use
> sapply() to loop over the results, creating a new column of NA's that
> will be cbind()ed back to the original data frame.
That's pretty much what rbind.fill does, with an extra bit of error
checking and generalise for any number of matrics:
> rbind.fill
function (...)
{
dfs <- list(...)
if (length(dfs) == 0)
return(list())
all.names <- unique(unlist(lapply(dfs, names)))
do.call("rbind", compact(lapply(dfs, function(df) {
if (length(df) == 0 || nrow(df) == 0)
return(NULL)
missing.vars <- setdiff(all.names, names(df))
if (length(missing.vars) > 0)
df[, missing.vars] <- NA
df
})))
}
<environment: namespace:reshape>
>
> Once that is done, the two new data frames, a.2 and b.2, will have
> common columns and they can then be rbind()ed.
>
>
> a.2 <- cbind(a, sapply(setdiff(colnames(b), colnames(a)),
> function(x) x = rep(NA, nrow(a))))
>
> b.2 <- cbind(b, sapply(setdiff(colnames(a), colnames(b)),
> function(x) x = rep(NA, nrow(b))))
>
> > a.2
> a b c
> 1 1 4 NA
> 2 2 5 NA
> 3 3 6 NA
> 4 4 7 NA
> 5 5 8 NA
>
> > b.2
> c a b
> 1 1 4 NA
> 2 2 5 NA
> 3 3 6 NA
> 4 4 7 NA
> 5 5 8 NA
>
>
> > rbind(a.2, b.2)
> a b c
> 1 1 4 NA
> 2 2 5 NA
> 3 3 6 NA
> 4 4 7 NA
> 5 5 8 NA
> 6 4 NA 1
> 7 5 NA 2
> 8 6 NA 3
> 9 7 NA 4
> 10 8 NA 5
>
>
> Hopefully that will work in more general cases, but I would validate
> that.
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list