[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y
Martin Maechler
maechler at stat.math.ethz.ch
Thu Feb 22 12:31:40 CET 2018
>>>>> Gabriel Becker <gmbecker at ucdavis.edu>
>>>>> on Wed, 21 Feb 2018 07:11:44 -0800 writes:
> Hi all,
> For the record this approach isnt 100% backwards compatible, because
> names(mergeddf) will e incompatibly different. Thatx why i claimed
> bakcwards compatable-ish
exactly.
> That said its still worth considering imho because of the reasons stated
> (and honestly one particular simple reading of the docs might suggest that
> this was thr intended behavior all along). Im not a member of Rcore through
> so i cant do said considering myself.
I agree with Scott, Frederik and you that this changes seems
worth considering.
As Duncan Murdoch has mentioned, this alone may not be
sufficient.
In addition to your proposed patch (which I have simplified, not
using intersection() but working with underlying match()
directly), it is little work to introduce an extra argument, I'm
calling 'no.dups = TRUE' which when set to false would mirror
current R's behavior... and documenting it, then also documents the
new behavior (to some extent).
My plan is to commit it soonish ;-)
Martin
> Best,
> ~G
> On Feb 20, 2018 7:15 PM, <frederik at ofb.net> wrote:
> Hi Scott,
> I tried the new patch and can confirm that it has the advertised
> behavior on a couple of test cases. I think it makes sense to apply
> it, because any existing code which refers to a second duplicate
> data.frame column by name is already broken, while if the reference is
> by numerical index then changing the column name shouldn't break it.
> I don't know if you need to update the documentation as part of your
> patch, or if whoever applies it would be happy to do that. Somebody
> from R core want to weigh in on this?
> I attach a file with the test example from your original email as well
> as a second test case I added with two "by" columns.
> Thanks,
> Frederick
> On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
>> Hi Frederick,
>>
>> It looks like I didn't overwrite the patch.diff file after the last edits.
>> Here's the correct patch (attached and copied below):
>>
>> Index: src/library/base/R/merge.R
>> ===================================================================
>> --- src/library/base/R/merge.R (revision 74280)
>> +++ src/library/base/R/merge.R (working copy)
>> @@ -157,6 +157,14 @@
>> }
>>
>> if(has.common.nms) names(y) <- nm.y
>> + ## If by.x %in% names(y) then duplicate column names still arise,
>> + ## apply suffixes to just y - this keeps backwards compatibility
>> + ## when referring to by.x in the resulting data.frame
>> + dupe.keyx <- intersect(nm.by, names(y))
>> + if(length(dupe.keyx)) {
>> + if(nzchar(suffixes[2L]))
>> + names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
>> suffixes[2L], sep="")
>> + }
>> nm <- c(names(x), names(y))
>> if(any(d <- duplicated(nm)))
>> if(sum(d) > 1L)
>>
>> Best,
>>
>> Scott
>>
>> On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:
>>
>> > Hi Scott,
>> >
>> > I think that's a good idea and I tried your patch on my copy of the
>> > repository. But it looks to me like the recent patch is identical to
>> > the previous one, can you confirm this?
>> >
>> > Frederick
>> >
>> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
>> > > Thanks Gabriel,
>> > >
>> > > I think your suggested approach is 100% backwards compatible
>> > >
>> > > Currently in the case of duplicate column names only the first can be
>> > > indexed by its name. This will always be the column appearing in by.x,
>> > > meaning the column in y with the same name cannot be accessed.
> Appending
>> > > ".y" (suffixes[2L]) to this column means it can now be accessed, while
>> > > keeping the current behaviour of making the key columns always
> accessible
>> > > by using the names provided to by.x.
>> > >
>> > > I've attached a new patch that has this behaviour.
>> > >
>> > > Best,
>> > >
>> > > Scott
>> > >
>> > > On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at ucdavis.edu>
>> > wrote:
>> > >
>> > > > It seems like there is a way that is backwards compatible-ish in the
>> > sense
>> > > > mentioned and still has the (arguably, but a good argument I think)
>> > better
>> > > > behavior:
>> > > >
>> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
>> > column
>> > > > is called name and y's 'name' column (not used int he merge) is
>> > changed to
>> > > > name.y.
>> > > >
>> > > > Now of course this would still change output, but it would change
> it to
>> > > > something I think would be better, while retaining the 'merge
> columns
>> > > > retain their exact names' mechanic as documented.
>> > > >
>> > > > ~G
>> > > >
>> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
> s.ritchie73 at gmail.com>
>> > > > wrote:
>> > > >
>> > > >> Thanks Duncan and Frederick,
>> > > >>
>> > > >> I suspected as much - there doesn't appear to be any reason why
>> > conflicts
>> > > >> between by.x and names(y) shouldn't and cannot be checked, but I
> can
>> > see
>> > > >> how this might be more trouble than its worth given it potentially
> may
>> > > >> break downstream packages (i.e. any cases where this occurs but
> they
>> > > >> expect
>> > > >> the name of the key column(s) to remain the same).
>> > > >>
>> > > >> Best,
>> > > >>
>> > > >> Scott
>> > > >>
>> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
>> > murdoch.duncan at gmail.com>
>> > > >> wrote:
>> > > >>
>> > > >> > On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
>> > > >> >
>> > > >> >> Hi Scott,
>> > > >> >>
>> > > >> >> Thanks for the patch. I'm not really involved in R development;
> it
>> > > >> >> will be up to someone in the R core team to apply it. I would
>> > hazard
>> > > >> >> to say that even if correct (I haven't checked), it will not be
>> > > >> >> applied because the change might break existing code. For
> example
>> > it
>> > > >> >> seems like reasonable code might easily assume that a column
> with
>> > the
>> > > >> >> same name as "by.x" exists in the output of 'merge'. That's
> just my
>> > > >> >> best guess... I don't participate on here often.
>> > > >> >>
>> > > >> >
>> > > >> >
>> > > >> > I think you're right. If I were still a member of R Core, I
> would
>> > want
>> > > >> to
>> > > >> > test this against all packages on CRAN and Bioconductor, and
> since
>> > that
>> > > >> > test takes a couple of days to run on my laptop, I'd probably
> never
>> > get
>> > > >> > around to it.
>> > > >> >
>> > > >> > There are lots of cases where "I would have done that
> differently",
>> > but
>> > > >> > most of them are far too much trouble to change now that R is
> more
>> > than
>> > > >> 20
>> > > >> > years old. And in many cases it will turn out that the way R
> does
>> > it
>> > > >> > actually does make more sense than the way I would have done it.
>> > > >> >
>> > > >> > Duncan Murdoch
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >> Cheers,
>> > > >> >>
>> > > >> >> Frederick
>> > > >> >>
>> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
>> > > >> >>
>> > > >> >>> The attached patch.diff will make merge.data.frame() append the
>> > > >> suffixes
>> > > >> >>> to
>> > > >> >>> columns with common names between by.x and names(y).
>> > > >> >>>
>> > > >> >>> Best,
>> > > >> >>>
>> > > >> >>> Scott Ritchie
>> > > >> >>>
>> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
>> > s.ritchie73 at gmail.com>
>> > > >> >>> wrote:
>> > > >> >>>
>> > > >> >>> Hi Frederick,
>> > > >> >>>>
>> > > >> >>>> I would expect that any duplicate names in the resulting
>> > data.frame
>> > > >> >>>> would
>> > > >> >>>> have the suffixes appended to them, regardless of whether or
> not
>> > they
>> > > >> >>>> are
>> > > >> >>>> used as the join key. So in my example I would expect
> "names.x"
>> > and
>> > > >> >>>> "names.y" to indicate their source data.frame.
>> > > >> >>>>
>> > > >> >>>> While careful reading of the documentation reveals this is not
>> > the
>> > > >> >>>> case, I
>> > > >> >>>> would argue the intent of the suffixes functionality should
>> > equally
>> > > >> be
>> > > >> >>>> applied to this type of case.
>> > > >> >>>>
>> > > >> >>>> If you agree this would be useful, I'm happy to write a patch
> for
>> > > >> >>>> merge.data.frame that will add suffixes in this case - I
> intend
>> > to do
>> > > >> >>>> the
>> > > >> >>>> same for merge.data.table in the data.table package where I
>> > initially
>> > > >> >>>> encountered the edge case.
>> > > >> >>>>
>> > > >> >>>> Best,
>> > > >> >>>>
>> > > >> >>>> Scott
>> > > >> >>>>
>> > > >> >>>> On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
>> > > >> >>>>
>> > > >> >>>> Hi Scott,
>> > > >> >>>>>
>> > > >> >>>>> It seems like reasonable behavior to me. What result would
> you
>> > > >> expect?
>> > > >> >>>>> That the second "name" should be called "name.y"?
>> > > >> >>>>>
>> > > >> >>>>> The "merge" documentation says:
>> > > >> >>>>>
>> > > >> >>>>> If the columns in the data frames not used in merging
> have
>> > any
>> > > >> >>>>> common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’
> by
>> > > >> default)
>> > > >> >>>>> appended to try to make the names of the result unique.
>> > > >> >>>>>
>> > > >> >>>>> Since the first "name" column was used in merging, leaving
> both
>> > > >> >>>>> without a suffix seems consistent with the documentation...
>> > > >> >>>>>
>> > > >> >>>>> Frederick
>> > > >> >>>>>
>> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
> wrote:
>> > > >> >>>>>
>> > > >> >>>>>> Hi,
>> > > >> >>>>>>
>> > > >> >>>>>> I was unable to find a bug report for this with a cursory
>> > search,
>> > > >> but
>> > > >> >>>>>>
>> > > >> >>>>> would
>> > > >> >>>>>
>> > > >> >>>>>> like clarification if this is intended or unavoidable
>> > behaviour:
>> > > >> >>>>>>
>> > > >> >>>>>> ```{r}
>> > > >> >>>>>> # Create example data.frames
>> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
>> > > >> >>>>>> sex=c("F", "M", "F", "M"),
>> > > >> >>>>>> age=c(41, 43, 36, 51))
>> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
>> > > >> >>>>>> name=c("Oliver", "Sebastian",
>> > "Kai-lee"),
>> > > >> >>>>>> sex=c("M", "M", "F"),
>> > > >> >>>>>> age=c(5,8,7))
>> > > >> >>>>>>
>> > > >> >>>>>> # Merge() creates a duplicated "name" column:
>> > > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
>> > > >> >>>>>> ```
>> > > >> >>>>>>
>> > > >> >>>>>> Output:
>> > > >> >>>>>> ```
>> > > >> >>>>>> name sex.x age.x name sex.y age.y
>> > > >> >>>>>> 1 Max M 43 Sebastian M 8
>> > > >> >>>>>> 2 Qin F 36 Kai-lee F 7
>> > > >> >>>>>> 3 Sarah F 41 Oliver M 5
>> > > >> >>>>>> Warning message:
>> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
>> > > >> >>>>>> "parent") :
>> > > >> >>>>>> column name ‘name’ is duplicated in the result
>> > > >> >>>>>> ```
>> > > >> >>>>>>
>> > > >> >>>>>> Kind Regards,
>> > > >> >>>>>>
>> > > >> >>>>>> Scott Ritchie
>> > > >> >>>>>>
>> > > >> >>>>>> [[alternative HTML version deleted]]
>> > > >> >>>>>>
>> > > >> >>>>>> ______________________________________________
>> > > >> >>>>>> R-devel at r-project.org mailing list
>> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> > > >> >>>>>>
>> > > >> >>>>>>
>> > > >> >>>>>
>> > > >> >>>>
>> > > >> >>>>
>> > > >> >> Index: src/library/base/R/merge.R
>> > > >> >>> ============================================================
>> > =======
>> > > >> >>> --- src/library/base/R/merge.R (revision 74264)
>> > > >> >>> +++ src/library/base/R/merge.R (working copy)
>> > > >> >>> @@ -157,6 +157,15 @@
>> > > >> >>> }
>> > > >> >>> if(has.common.nms) names(y) <- nm.y
>> > > >> >>> + ## If by.x %in% names(y) then duplicate column names
>> > still
>> > > >> >>> arise,
>> > > >> >>> + ## apply suffixes to these
>> > > >> >>> + dupe.keyx <- intersect(nm.by, names(y))
>> > > >> >>> + if(length(dupe.keyx)) {
>> > > >> >>> + if(nzchar(suffixes[1L]))
>> > > >> >>> + names(x)[match(dupe.keyx, names(x), 0L)] <-
>> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
>> > > >> >>> + if(nzchar(suffixes[2L]))
>> > > >> >>> + names(y)[match(dupe.keyx, names(y), 0L)] <-
>> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
>> > > >> >>> + }
>> > > >> >>> nm <- c(names(x), names(y))
>> > > >> >>> if(any(d <- duplicated(nm)))
>> > > >> >>> if(sum(d) > 1L)
>> > > >> >>>
>> > > >> >>
>> > > >> >> ______________________________________________
>> > > >> >> R-devel at r-project.org mailing list
>> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> > > >> >>
>> > > >> >>
>> > > >> >
>> > > >>
>> > > >> [[alternative HTML version deleted]]
>> > > >>
>> > > >> ______________________________________________
>> > > >> R-devel at r-project.org mailing list
>> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Gabriel Becker, PhD
>> > > > Scientist (Bioinformatics)
>> > > > Genentech Research
>> > > >
>> >
>> > > Index: src/library/base/R/merge.R
>> > > ===================================================================
>> > > --- src/library/base/R/merge.R (revision 74264)
>> > > +++ src/library/base/R/merge.R (working copy)
>> > > @@ -157,6 +157,15 @@
>> > > }
>> > >
>> > > if(has.common.nms) names(y) <- nm.y
>> > > + ## If by.x %in% names(y) then duplicate column names still
>> > arise,
>> > > + ## apply suffixes to these
>> > > + dupe.keyx <- intersect(nm.by, names(y))
>> > > + if(length(dupe.keyx)) {
>> > > + if(nzchar(suffixes[1L]))
>> > > + names(x)[match(dupe.keyx, names(x), 0L)] <-
>> > paste(dupe.keyx, suffixes[1L], sep="")
>> > > + if(nzchar(suffixes[2L]))
>> > > + names(y)[match(dupe.keyx, names(y), 0L)] <-
>> > paste(dupe.keyx, suffixes[2L], sep="")
>> > > + }
>> > > nm <- c(names(x), names(y))
>> > > if(any(d <- duplicated(nm)))
>> > > if(sum(d) > 1L)
>> >
>> > > ______________________________________________
>> > > R-devel at r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>> >
>> Index: src/library/base/R/merge.R
>> ===================================================================
>> --- src/library/base/R/merge.R (revision 74280)
>> +++ src/library/base/R/merge.R (working copy)
>> @@ -157,6 +157,14 @@
>> }
>>
>> if(has.common.nms) names(y) <- nm.y
>> + ## If by.x %in% names(y) then duplicate column names still arise,
>> + ## apply suffixes to just y - this keeps backwards compatibility
>> + ## when referring to by.x in the resulting data.frame
>> + dupe.keyx <- intersect(nm.by, names(y))
>> + if(length(dupe.keyx)) {
>> + if(nzchar(suffixes[2L]))
>> + names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
> suffixes[2L], sep="")
>> + }
>> nm <- c(names(x), names(y))
>> if(any(d <- duplicated(nm)))
>> if(sum(d) > 1L)
> [[alternative HTML version deleted]]
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list