[Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y

Wed Feb 21 00:06:21 CET 2018

Hi Frederick,

It looks like I didn't overwrite the patch.diff file after the last edits.
Here's the correct patch (attached and copied below):

Index: src/library/base/R/merge.R
===================================================================

--- src/library/base/R/merge.R (revision 74280)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,14 @@
         }

         if(has.common.nms) names(y) <- nm.y
+        ## If by.x %in% names(y) then duplicate column names still arise,
+        ## apply suffixes to just y - this keeps backwards compatibility
+        ## when referring to by.x in the resulting data.frame
+        dupe.keyx <- intersect(nm.by, names(y))
+        if(length(dupe.keyx)) {
+          if(nzchar(suffixes[2L]))
+            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
suffixes[2L], sep="")
+        }
         nm <- c(names(x), names(y))
         if(any(d <- duplicated(nm)))
             if(sum(d) > 1L)

Best,

Scott

On 21 February 2018 at 08:23, <frederik at ofb.net> wrote:

> Hi Scott,
>
> I think that's a good idea and I tried your patch on my copy of the
> repository. But it looks to me like the recent patch is identical to
> the previous one, can you confirm this?
>
> Frederick
>
> On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
> > Thanks Gabriel,
> >
> > I think your suggested approach is 100% backwards compatible
> >
> > Currently in the case of duplicate column names only the first can be
> > indexed by its name. This will always be the column appearing in by.x,
> > meaning the column in y with the same name cannot be accessed. Appending
> > ".y" (suffixes[2L]) to this column means it can now be accessed, while
> > keeping the current behaviour of making the key columns always accessible
> > by using the names provided to by.x.
> >
> > I've attached a new patch that has this behaviour.
> >
> > Best,
> >
> > Scott
> >
> > On 19 February 2018 at 05:08, Gabriel Becker <gmbecker at ucdavis.edu>
> wrote:
> >
> > > It seems like there is a way that is backwards compatible-ish in the
> sense
> > > mentioned and still has the (arguably, but a good argument I think)
> better
> > > behavior:
> > >
> > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
> column
> > > is called name and y's 'name' column (not used int he merge) is
> changed to
> > > name.y.
> > >
> > > Now of course this would still change output, but it would change it to
> > > something I think would be better, while retaining the 'merge columns
> > > retain their exact names' mechanic as documented.
> > >
> > > ~G
> > >
> > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <s.ritchie73 at gmail.com>
> > > wrote:
> > >
> > >> Thanks Duncan and Frederick,
> > >>
> > >> I suspected as much - there doesn't appear to be any reason why
> conflicts
> > >> between by.x and names(y) shouldn't and cannot be checked, but I can
> see
> > >> how this might be more trouble than its worth given it potentially may
> > >> break downstream packages (i.e. any cases where this occurs but they
> > >> expect
> > >> the name of the key column(s) to remain the same).
> > >>
> > >> Best,
> > >>
> > >> Scott
> > >>
> > >> On 18 February 2018 at 11:48, Duncan Murdoch <
> murdoch.duncan at gmail.com>
> > >> wrote:
> > >>
> > >> > On 17/02/2018 6:36 PM, frederik at ofb.net wrote:
> > >> >
> > >> >> Hi Scott,
> > >> >>
> > >> >> Thanks for the patch. I'm not really involved in R development; it
> > >> >> will be up to someone in the R core team to apply it. I would
> hazard
> > >> >> to say that even if correct (I haven't checked), it will not be
> > >> >> applied because the change might break existing code. For example
> it
> > >> >> seems like reasonable code might easily assume that a column with
> the
> > >> >> same name as "by.x" exists in the output of 'merge'. That's just my
> > >> >> best guess... I don't participate on here often.
> > >> >>
> > >> >
> > >> >
> > >> > I think you're right.  If I were still a member of R Core, I would
> want
> > >> to
> > >> > test this against all packages on CRAN and Bioconductor, and since
> that
> > >> > test takes a couple of days to run on my laptop, I'd probably never
> get
> > >> > around to it.
> > >> >
> > >> > There are lots of cases where "I would have done that differently",
> but
> > >> > most of them are far too much trouble to change now that R is more
> than
> > >> 20
> > >> > years old.  And in many cases it will turn out that the way R does
> it
> > >> > actually does make more sense than the way I would have done it.
> > >> >
> > >> > Duncan Murdoch
> > >> >
> > >> >
> > >> >
> > >> >> Cheers,
> > >> >>
> > >> >> Frederick
> > >> >>
> > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
> > >> >>
> > >> >>> The attached patch.diff will make merge.data.frame() append the
> > >> suffixes
> > >> >>> to
> > >> >>> columns with common names between by.x and names(y).
> > >> >>>
> > >> >>> Best,
> > >> >>>
> > >> >>> Scott Ritchie
> > >> >>>
> > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
> s.ritchie73 at gmail.com>
> > >> >>> wrote:
> > >> >>>
> > >> >>> Hi Frederick,
> > >> >>>>
> > >> >>>> I would expect that any duplicate names in the resulting
> data.frame
> > >> >>>> would
> > >> >>>> have the suffixes appended to them, regardless of whether or not
> they
> > >> >>>> are
> > >> >>>> used as the join key. So in my example I would expect "names.x"
> and
> > >> >>>> "names.y" to indicate their source data.frame.
> > >> >>>>
> > >> >>>> While careful reading of the documentation reveals this is not
> the
> > >> >>>> case, I
> > >> >>>> would argue the intent of the suffixes functionality should
> equally
> > >> be
> > >> >>>> applied to this type of case.
> > >> >>>>
> > >> >>>> If you agree this would be useful, I'm happy to write a patch for
> > >> >>>> merge.data.frame that will add suffixes in this case - I intend
> to do
> > >> >>>> the
> > >> >>>> same for merge.data.table in the data.table package where I
> initially
> > >> >>>> encountered the edge case.
> > >> >>>>
> > >> >>>> Best,
> > >> >>>>
> > >> >>>> Scott
> > >> >>>>
> > >> >>>> On 17 February 2018 at 03:53, <frederik at ofb.net> wrote:
> > >> >>>>
> > >> >>>> Hi Scott,
> > >> >>>>>
> > >> >>>>> It seems like reasonable behavior to me. What result would you
> > >> expect?
> > >> >>>>> That the second "name" should be called "name.y"?
> > >> >>>>>
> > >> >>>>> The "merge" documentation says:
> > >> >>>>>
> > >> >>>>>      If the columns in the data frames not used in merging have
> any
> > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by
> > >> default)
> > >> >>>>>      appended to try to make the names of the result unique.
> > >> >>>>>
> > >> >>>>> Since the first "name" column was used in merging, leaving both
> > >> >>>>> without a suffix seems consistent with the documentation...
> > >> >>>>>
> > >> >>>>> Frederick
> > >> >>>>>
> > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> > >> >>>>>
> > >> >>>>>> Hi,
> > >> >>>>>>
> > >> >>>>>> I was unable to find a bug report for this with a cursory
> search,
> > >> but
> > >> >>>>>>
> > >> >>>>> would
> > >> >>>>>
> > >> >>>>>> like clarification if this is intended or unavoidable
> behaviour:
> > >> >>>>>>
> > >> >>>>>> ```{r}
> > >> >>>>>> # Create example data.frames
> > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> > >> >>>>>>                        sex=c("F", "M", "F", "M"),
> > >> >>>>>>                        age=c(41, 43, 36, 51))
> > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> > >> >>>>>>                         name=c("Oliver", "Sebastian",
> "Kai-lee"),
> > >> >>>>>>                         sex=c("M", "M", "F"),
> > >> >>>>>>                         age=c(5,8,7))
> > >> >>>>>>
> > >> >>>>>> # Merge() creates a duplicated "name" column:
> > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
> > >> >>>>>> ```
> > >> >>>>>>
> > >> >>>>>> Output:
> > >> >>>>>> ```
> > >> >>>>>>     name sex.x age.x      name sex.y age.y
> > >> >>>>>> 1   Max     M    43 Sebastian     M     8
> > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
> > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
> > >> >>>>>> Warning message:
> > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
> > >> >>>>>> "parent") :
> > >> >>>>>>    column name ‘name’ is duplicated in the result
> > >> >>>>>> ```
> > >> >>>>>>
> > >> >>>>>> Kind Regards,
> > >> >>>>>>
> > >> >>>>>> Scott Ritchie
> > >> >>>>>>
> > >> >>>>>>        [[alternative HTML version deleted]]
> > >> >>>>>>
> > >> >>>>>> ______________________________________________
> > >> >>>>>> R-devel at r-project.org mailing list
> > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>
> > >> >>>>
> > >> >>>>
> > >> >> Index: src/library/base/R/merge.R
> > >> >>> ============================================================
> =======
> > >> >>> --- src/library/base/R/merge.R  (revision 74264)
> > >> >>> +++ src/library/base/R/merge.R  (working copy)
> > >> >>> @@ -157,6 +157,15 @@
> > >> >>>           }
> > >> >>>             if(has.common.nms) names(y) <- nm.y
> > >> >>> +        ## If by.x %in% names(y) then duplicate column names
> still
> > >> >>> arise,
> > >> >>> +        ## apply suffixes to these
> > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
> > >> >>> +        if(length(dupe.keyx)) {
> > >> >>> +          if(nzchar(suffixes[1L]))
> > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
> > >> >>> +          if(nzchar(suffixes[2L]))
> > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
> > >> >>> +        }
> > >> >>>           nm <- c(names(x), names(y))
> > >> >>>           if(any(d <- duplicated(nm)))
> > >> >>>               if(sum(d) > 1L)
> > >> >>>
> > >> >>
> > >> >> ______________________________________________
> > >> >> R-devel at r-project.org mailing list
> > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >> >>
> > >> >>
> > >> >
> > >>
> > >>         [[alternative HTML version deleted]]
> > >>
> > >> ______________________________________________
> > >> R-devel at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >>
> > >
> > >
> > >
> > > --
> > > Gabriel Becker, PhD
> > > Scientist (Bioinformatics)
> > > Genentech Research
> > >
>
> > Index: src/library/base/R/merge.R
> > ===================================================================
> > --- src/library/base/R/merge.R        (revision 74264)
> > +++ src/library/base/R/merge.R        (working copy)
> > @@ -157,6 +157,15 @@
> >          }
> >
> >          if(has.common.nms) names(y) <- nm.y
> > +        ## If by.x %in% names(y) then duplicate column names still
> arise,
> > +        ## apply suffixes to these
> > +        dupe.keyx <- intersect(nm.by, names(y))
> > +        if(length(dupe.keyx)) {
> > +          if(nzchar(suffixes[1L]))
> > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> paste(dupe.keyx, suffixes[1L], sep="")
> > +          if(nzchar(suffixes[2L]))
> > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> paste(dupe.keyx, suffixes[2L], sep="")
> > +        }
> >          nm <- c(names(x), names(y))
> >          if(any(d <- duplicated(nm)))
> >              if(sum(d) > 1L)
>
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.diff
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20180221/7e77cbd6/attachment.ksh>