[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y
Martin Maechler
maechler at stat.math.ethz.ch
Fri Feb 23 15:03:09 CET 2018
>>>>> Scott Ritchie <s.ritchie73 at gmail.com>
>>>>> on Fri, 23 Feb 2018 12:32:41 +1100 writes:
> Thanks Martin!
> Can you clarify the functionality of the 'no.dups' argument so I can change
> my patch to `data.table:::merge.data.table` accordingly?
> - When `no.dups=TRUE` will the suffix to the by.x column name? Or will it
> take the functionality of the second functionality where only the column in
> y has the suffix added?
> - When `no.dups=FALSE` will the output be the same as it currently (no
> suffix added to either column)? Or will add the suffix to the column in y?
I had started from your patch... and worked from there.
So, there's no need (and use) to provide another one.
I also needed to update the man page, add a regression test, add
an entry to NEWS.Rd ...
Just wait until I commit..
Martin
> Best,
> Scott
> On 22 February 2018 at 22:31, Martin Maechler <maechler at stat.math.ethz.ch>
> wrote:
>> >>>>> Gabriel Becker <gmbecker at ucdavis.edu>
>> >>>>> on Wed, 21 Feb 2018 07:11:44 -0800 writes:
>>
>> > Hi all,
>> > For the record this approach isnt 100% backwards compatible, because
>> > names(mergeddf) will e incompatibly different. Thatx why i claimed
>> > bakcwards compatable-ish
>>
>> exactly.
>>
>> > That said its still worth considering imho because of the reasons
>> stated
>> > (and honestly one particular simple reading of the docs might
>> suggest that
>> > this was thr intended behavior all along). Im not a member of Rcore
>> through
>> > so i cant do said considering myself.
>>
>> I agree with Scott, Frederik and you that this changes seems
>> worth considering.
>> As Duncan Murdoch has mentioned, this alone may not be
>> sufficient.
>>
>> In addition to your proposed patch (which I have simplified, not
>> using intersection() but working with underlying match()
>> directly), it is little work to introduce an extra argument, I'm
>> calling 'no.dups = TRUE' which when set to false would mirror
>> current R's behavior... and documenting it, then also documents the
>> new behavior (to some extent).
>>
>> My plan is to commit it soonish ;-)
>> Martin
>>
>> > Best,
>> > ~G
>>
>> > On Feb 20, 2018 7:15 PM, <frederik at ofb.net> wrote:
>>
>> > Hi Scott,
>>
>> > I tried the new patch and can confirm that it has the advertised
>> > behavior on a couple of test cases. I think it makes sense to apply
>> > it, because any existing code which refers to a second duplicate
>> > data.frame column by name is already broken, while if the reference
>> is
>> > by numerical index then changing the column name shouldn't break it.
>>
>> > I don't know if you need to update the documentation as part of your
>> > patch, or if whoever applies it would be happy to do that. Somebody
>> > from R core want to weigh in on this?
>>
>> > I attach a file with the test example from your original email as
>> well
>> > as a second test case I added with two "by" columns.
>>
>> > Thanks,
>>
>> > Frederick
>>
>> > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
>> >> Hi Frederick,
>> >>
>> >> It looks like I didn't overwrite the patch.diff file after the last
>> edits.
>> >> Here's the correct patch (attached and copied below):
>> >>
>> >> Index: src/library/base/R/merge.R
>> >> ===================================================================
>> >> --- src/library/base/R/merge.R (revision 74280)
>> >> +++ src/library/base/R/merge.R (working copy)
>> >> @@ -157,6 +157,14 @@
>> >> }
>> >>
>> >> if(has.common.nms) names(y) <- nm.y
>> >> + ## If by.x %in% names(y) then duplicate column names still
>> arise,
>> >> + ## apply suffixes to just y - this keeps backwards
>> compatibility
>> >> + ## when referring to by.x in the resulting data.frame
>> >> + dupe.keyx <- intersect(nm.by, names(y))
>> >> + if(length(dupe.keyx)) {
>> >> + if(nzchar(suffixes[2L]))
>> >> + names(y)[match(dupe.keyx, names(y), 0L)] <-
>> paste(dupe.keyx,
>> >> suffixes[2L], sep="")
>> >> + }
>> >> nm <- c(names(x), names(y))
>> >> if(any(d <- duplicated(nm)))
>> >> if(sum(d) > 1L)
>> >>
>> >> Best,
>> >>
>> >> Scott
>> >>
>> >> On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:
>> >>
>> >> > Hi Scott,
>> >> >
>> >> > I think that's a good idea and I tried your patch on my copy of
>> the
>> >> > repository. But it looks to me like the recent patch is identical
>> to
>> >> > the previous one, can you confirm this?
>> >> >
>> >> > Frederick
>> >> >
>> >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
>> >> > > Thanks Gabriel,
>> >> > >
>> >> > > I think your suggested approach is 100% backwards compatible
>> >> > >
>> >> > > Currently in the case of duplicate column names only the first
>> can be
>> >> > > indexed by its name. This will always be the column appearing
>> in by.x,
>> >> > > meaning the column in y with the same name cannot be accessed.
>> > Appending
>> >> > > ".y" (suffixes[2L]) to this column means it can now be
>> accessed, while
>> >> > > keeping the current behaviour of making the key columns always
>> > accessible
>> >> > > by using the names provided to by.x.
>> >> > >
>> >> > > I've attached a new patch that has this behaviour.
>> >> > >
>> >> > > Best,
>> >> > >
>> >> > > Scott
>> >> > >
>> >> > > On 19 February 2018 at 05:08, Gabriel Becker <
>> gmbecker at ucdavis.edu>
>> >> > wrote:
>> >> > >
>> >> > > > It seems like there is a way that is backwards compatible-ish
>> in the
>> >> > sense
>> >> > > > mentioned and still has the (arguably, but a good argument I
>> think)
>> >> > better
>> >> > > > behavior:
>> >> > > >
>> >> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's
>> 'name'
>> >> > column
>> >> > > > is called name and y's 'name' column (not used int he merge)
>> is
>> >> > changed to
>> >> > > > name.y.
>> >> > > >
>> >> > > > Now of course this would still change output, but it would
>> change
>> > it to
>> >> > > > something I think would be better, while retaining the 'merge
>> > columns
>> >> > > > retain their exact names' mechanic as documented.
>> >> > > >
>> >> > > > ~G
>> >> > > >
>> >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
>> > s.ritchie73 at gmail.com>
>> >> > > > wrote:
>> >> > > >
>> >> > > >> Thanks Duncan and Frederick,
>> >> > > >>
>> >> > > >> I suspected as much - there doesn't appear to be any reason
>> why
>> >> > conflicts
>> >> > > >> between by.x and names(y) shouldn't and cannot be checked,
>> but I
>> > can
>> >> > see
>> >> > > >> how this might be more trouble than its worth given it
>> potentially
>> > may
>> >> > > >> break downstream packages (i.e. any cases where this occurs
>> but
>> > they
>> >> > > >> expect
>> >> > > >> the name of the key column(s) to remain the same).
>> >> > > >>
>> >> > > >> Best,
>> >> > > >>
>> >> > > >> Scott
>> >> > > >>
>> >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
>> >> > murdoch.duncan at gmail.com>
>> >> > > >> wrote:
>> >> > > >>
>> >> > > >> > On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
>> >> > > >> >
>> >> > > >> >> Hi Scott,
>> >> > > >> >>
>> >> > > >> >> Thanks for the patch. I'm not really involved in R
>> development;
>> > it
>> >> > > >> >> will be up to someone in the R core team to apply it. I
>> would
>> >> > hazard
>> >> > > >> >> to say that even if correct (I haven't checked), it will
>> not be
>> >> > > >> >> applied because the change might break existing code. For
>> > example
>> >> > it
>> >> > > >> >> seems like reasonable code might easily assume that a
>> column
>> > with
>> >> > the
>> >> > > >> >> same name as "by.x" exists in the output of 'merge'.
>> That's
>> > just my
>> >> > > >> >> best guess... I don't participate on here often.
>> >> > > >> >>
>> >> > > >> >
>> >> > > >> >
>> >> > > >> > I think you're right. If I were still a member of R Core,
>> I
>> > would
>> >> > want
>> >> > > >> to
>> >> > > >> > test this against all packages on CRAN and Bioconductor,
>> and
>> > since
>> >> > that
>> >> > > >> > test takes a couple of days to run on my laptop, I'd
>> probably
>> > never
>> >> > get
>> >> > > >> > around to it.
>> >> > > >> >
>> >> > > >> > There are lots of cases where "I would have done that
>> > differently",
>> >> > but
>> >> > > >> > most of them are far too much trouble to change now that R
>> is
>> > more
>> >> > than
>> >> > > >> 20
>> >> > > >> > years old. And in many cases it will turn out that the
>> way R
>> > does
>> >> > it
>> >> > > >> > actually does make more sense than the way I would have
>> done it.
>> >> > > >> >
>> >> > > >> > Duncan Murdoch
>> >> > > >> >
>> >> > > >> >
>> >> > > >> >
>> >> > > >> >> Cheers,
>> >> > > >> >>
>> >> > > >> >> Frederick
>> >> > > >> >>
>> >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie
>> wrote:
>> >> > > >> >>
>> >> > > >> >>> The attached patch.diff will make merge.data.frame()
>> append the
>> >> > > >> suffixes
>> >> > > >> >>> to
>> >> > > >> >>> columns with common names between by.x and names(y).
>> >> > > >> >>>
>> >> > > >> >>> Best,
>> >> > > >> >>>
>> >> > > >> >>> Scott Ritchie
>> >> > > >> >>>
>> >> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
>> >> > s.ritchie73 at gmail.com>
>> >> > > >> >>> wrote:
>> >> > > >> >>>
>> >> > > >> >>> Hi Frederick,
>> >> > > >> >>>>
>> >> > > >> >>>> I would expect that any duplicate names in the resulting
>> >> > data.frame
>> >> > > >> >>>> would
>> >> > > >> >>>> have the suffixes appended to them, regardless of
>> whether or
>> > not
>> >> > they
>> >> > > >> >>>> are
>> >> > > >> >>>> used as the join key. So in my example I would expect
>> > "names.x"
>> >> > and
>> >> > > >> >>>> "names.y" to indicate their source data.frame.
>> >> > > >> >>>>
>> >> > > >> >>>> While careful reading of the documentation reveals this
>> is not
>> >> > the
>> >> > > >> >>>> case, I
>> >> > > >> >>>> would argue the intent of the suffixes functionality
>> should
>> >> > equally
>> >> > > >> be
>> >> > > >> >>>> applied to this type of case.
>> >> > > >> >>>>
>> >> > > >> >>>> If you agree this would be useful, I'm happy to write a
>> patch
>> > for
>> >> > > >> >>>> merge.data.frame that will add suffixes in this case - I
>> > intend
>> >> > to do
>> >> > > >> >>>> the
>> >> > > >> >>>> same for merge.data.table in the data.table package
>> where I
>> >> > initially
>> >> > > >> >>>> encountered the edge case.
>> >> > > >> >>>>
>> >> > > >> >>>> Best,
>> >> > > >> >>>>
>> >> > > >> >>>> Scott
>> >> > > >> >>>>
>> >> > > >> >>>> On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
>> >> > > >> >>>>
>> >> > > >> >>>> Hi Scott,
>> >> > > >> >>>>>
>> >> > > >> >>>>> It seems like reasonable behavior to me. What result
>> would
>> > you
>> >> > > >> expect?
>> >> > > >> >>>>> That the second "name" should be called "name.y"?
>> >> > > >> >>>>>
>> >> > > >> >>>>> The "merge" documentation says:
>> >> > > >> >>>>>
>> >> > > >> >>>>> If the columns in the data frames not used in
>> merging
>> > have
>> >> > any
>> >> > > >> >>>>> common names, these have ‘suffixes’ (‘".x"’ and
>> ‘".y"’
>> > by
>> >> > > >> default)
>> >> > > >> >>>>> appended to try to make the names of the result
>> unique.
>> >> > > >> >>>>>
>> >> > > >> >>>>> Since the first "name" column was used in merging,
>> leaving
>> > both
>> >> > > >> >>>>> without a suffix seems consistent with the
>> documentation...
>> >> > > >> >>>>>
>> >> > > >> >>>>> Frederick
>> >> > > >> >>>>>
>> >> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
>> > wrote:
>> >> > > >> >>>>>
>> >> > > >> >>>>>> Hi,
>> >> > > >> >>>>>>
>> >> > > >> >>>>>> I was unable to find a bug report for this with a
>> cursory
>> >> > search,
>> >> > > >> but
>> >> > > >> >>>>>>
>> >> > > >> >>>>> would
>> >> > > >> >>>>>
>> >> > > >> >>>>>> like clarification if this is intended or unavoidable
>> >> > behaviour:
>> >> > > >> >>>>>>
>> >> > > >> >>>>>> ```{r}
>> >> > > >> >>>>>> # Create example data.frames
>> >> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin",
>> "Lex"),
>> >> > > >> >>>>>> sex=c("F", "M", "F", "M"),
>> >> > > >> >>>>>> age=c(41, 43, 36, 51))
>> >> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max",
>> "Qin"),
>> >> > > >> >>>>>> name=c("Oliver", "Sebastian",
>> >> > "Kai-lee"),
>> >> > > >> >>>>>> sex=c("M", "M", "F"),
>> >> > > >> >>>>>> age=c(5,8,7))
>> >> > > >> >>>>>>
>> >> > > >> >>>>>> # Merge() creates a duplicated "name" column:
>> >> > > >> >>>>>> merge(parents, children, by.x = "name", by.y =
>> "parent")
>> >> > > >> >>>>>> ```
>> >> > > >> >>>>>>
>> >> > > >> >>>>>> Output:
>> >> > > >> >>>>>> ```
>> >> > > >> >>>>>> name sex.x age.x name sex.y age.y
>> >> > > >> >>>>>> 1 Max M 43 Sebastian M 8
>> >> > > >> >>>>>> 2 Qin F 36 Kai-lee F 7
>> >> > > >> >>>>>> 3 Sarah F 41 Oliver M 5
>> >> > > >> >>>>>> Warning message:
>> >> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name",
>> by.y =
>> >> > > >> >>>>>> "parent") :
>> >> > > >> >>>>>> column name ‘name’ is duplicated in the result
>> >> > > >> >>>>>> ```
>> >> > > >> >>>>>>
>> >> > > >> >>>>>> Kind Regards,
>> >> > > >> >>>>>>
>> >> > > >> >>>>>> Scott Ritchie
>> >> > > >> >>>>>>
>> >> > > >> >>>>>> [[alternative HTML version deleted]]
>> >> > > >> >>>>>>
>> >> > > >> >>>>>> ______________________________________________
>> >> > > >> >>>>>> R-devel at r-project.org mailing list
>> >> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >> > > >> >>>>>>
>> >> > > >> >>>>>>
>> >> > > >> >>>>>
>> >> > > >> >>>>
>> >> > > >> >>>>
>> >> > > >> >> Index: src/library/base/R/merge.R
>> >> > > >> >>> ==============================
>> ==============================
>> >> > =======
>> >> > > >> >>> --- src/library/base/R/merge.R (revision 74264)
>> >> > > >> >>> +++ src/library/base/R/merge.R (working copy)
>> >> > > >> >>> @@ -157,6 +157,15 @@
>> >> > > >> >>> }
>> >> > > >> >>> if(has.common.nms) names(y) <- nm.y
>> >> > > >> >>> + ## If by.x %in% names(y) then duplicate column
>> names
>> >> > still
>> >> > > >> >>> arise,
>> >> > > >> >>> + ## apply suffixes to these
>> >> > > >> >>> + dupe.keyx <- intersect(nm.by, names(y))
>> >> > > >> >>> + if(length(dupe.keyx)) {
>> >> > > >> >>> + if(nzchar(suffixes[1L]))
>> >> > > >> >>> + names(x)[match(dupe.keyx, names(x), 0L)] <-
>> >> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
>> >> > > >> >>> + if(nzchar(suffixes[2L]))
>> >> > > >> >>> + names(y)[match(dupe.keyx, names(y), 0L)] <-
>> >> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
>> >> > > >> >>> + }
>> >> > > >> >>> nm <- c(names(x), names(y))
>> >> > > >> >>> if(any(d <- duplicated(nm)))
>> >> > > >> >>> if(sum(d) > 1L)
>> >> > > >> >>>
>> >> > > >> >>
>> >> > > >> >> ______________________________________________
>> >> > > >> >> R-devel at r-project.org mailing list
>> >> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >> > > >> >>
>> >> > > >> >>
>> >> > > >> >
>> >> > > >>
>> >> > > >> [[alternative HTML version deleted]]
>> >> > > >>
>> >> > > >> ______________________________________________
>> >> > > >> R-devel at r-project.org mailing list
>> >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >> > > >>
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > > Gabriel Becker, PhD
>> >> > > > Scientist (Bioinformatics)
>> >> > > > Genentech Research
>> >> > > >
>> >> >
>> >> > > Index: src/library/base/R/merge.R
>> >> > > ============================================================
>> =======
>> >> > > --- src/library/base/R/merge.R (revision 74264)
>> >> > > +++ src/library/base/R/merge.R (working copy)
>> >> > > @@ -157,6 +157,15 @@
>> >> > > }
>> >> > >
>> >> > > if(has.common.nms) names(y) <- nm.y
>> >> > > + ## If by.x %in% names(y) then duplicate column names
>> still
>> >> > arise,
>> >> > > + ## apply suffixes to these
>> >> > > + dupe.keyx <- intersect(nm.by, names(y))
>> >> > > + if(length(dupe.keyx)) {
>> >> > > + if(nzchar(suffixes[1L]))
>> >> > > + names(x)[match(dupe.keyx, names(x), 0L)] <-
>> >> > paste(dupe.keyx, suffixes[1L], sep="")
>> >> > > + if(nzchar(suffixes[2L]))
>> >> > > + names(y)[match(dupe.keyx, names(y), 0L)] <-
>> >> > paste(dupe.keyx, suffixes[2L], sep="")
>> >> > > + }
>> >> > > nm <- c(names(x), names(y))
>> >> > > if(any(d <- duplicated(nm)))
>> >> > > if(sum(d) > 1L)
>> >> >
>> >> > > ______________________________________________
>> >> > > R-devel at r-project.org mailing list
>> >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >> >
>> >> >
>>
>> >> Index: src/library/base/R/merge.R
>> >> ===================================================================
>> >> --- src/library/base/R/merge.R (revision 74280)
>> >> +++ src/library/base/R/merge.R (working copy)
>> >> @@ -157,6 +157,14 @@
>> >> }
>> >>
>> >> if(has.common.nms) names(y) <- nm.y
>> >> + ## If by.x %in% names(y) then duplicate column names still
>> arise,
>> >> + ## apply suffixes to just y - this keeps backwards
>> compatibility
>> >> + ## when referring to by.x in the resulting data.frame
>> >> + dupe.keyx <- intersect(nm.by, names(y))
>> >> + if(length(dupe.keyx)) {
>> >> + if(nzchar(suffixes[2L]))
>> >> + names(y)[match(dupe.keyx, names(y), 0L)] <-
>> paste(dupe.keyx,
>> > suffixes[2L], sep="")
>> >> + }
>> >> nm <- c(names(x), names(y))
>> >> if(any(d <- duplicated(nm)))
>> >> if(sum(d) > 1L)
>>
>> > [[alternative HTML version deleted]]
>>
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> [[alternative HTML version deleted]]
More information about the R-devel
mailing list