[Rd] Bug report - duplicate row names with as.data.frame()

Thu Mar 8 18:09:52 CET 2018

>>>>> Martyn Plummer <plummerm at iarc.fr>
>>>>>     on Thu, 1 Mar 2018 17:23:04 +0000 writes:

    > On Thu, 2018-03-01 at 09:36 -0500, Ron wrote:
    >> Hello,
    >> 
    >> I'd like to report what I think is a bug: using as.data.frame() we can
    >> create duplicate row names in a data frame. R version 3.4.3 (current stable
    >> release).
    >> 
    >> Rather than paste code in an email, please see the example formatted code
    >> here:
    >> https://stackoverflow.com/questions/49031523/duplicate-row-names-in-r-using-as-data-frame
    >> 
    >> I posted to StackOverflow, and consensus was that we should proceed with
    >> this as a bug report.

    > Yes that is definitely a bug. 

    > The end of the as.data.frame.matrix method has:

    > attr(value, "row.names") <- row.names
    > class(value) <- "data.frame"
    > value

    > Changing this to:

    > class(value) <- "data.frame"
    > row.names(value) <- row.names
    > value

    > ensures that the row.names<-.data.frame method is called with its built
    > -in check for duplicate names.

    > There are quite a few as.data.frame methods so this could be a
    > recurring problem. I will check.

and Martyn found other cases and proposed a more principled
approach to conceptually all such situations.

>From that, I have addressed at least the current bug
(and its immediate surroundings).  I now have committed the
following to 'R-devel' (= the R sources development "trunk") :

------------------------------------------------------------------------
r74373 | maechler | 2018-03-08 17:49:32 +0100 (Thu, 08. Mar 2018)

   M doc/NEWS.Rd
   M src/library/base/R/dataframe.R
   M src/library/base/man/as.data.frame.Rd
   M src/library/base/man/row.names.Rd
   M tests/eval-etc.Rout.save
   M tests/reg-tests-1c.R
   M tests/reg-tests-1d.R
   M tests/reg-tests-2.Rout.save

duplicated rownames in as.data.frame.matrix() are handled now (gracefully by default)
------------------------------------------------------------------------

The NEWS entry is

    • Some as.data.frame() methods, notably the matrix one, are now
      more careful in not accepting duplicated or NA row names, and by
      default produce unique non-NA row names.  This is based on
      row.names(x, make.names = *) <- rNms where make.names is a new
      logical, with back compatible default.

and the not-quite-back-compatible API change is that the
`row.names<-` S3 generic function now does have a new optional
'make.names' argument -- with back compatible default FALSE
(meaning that invalid rownames by default continue to lead to an error).

It may happen that this or the other changes have some negative
impact on the CRAN package check results, (I do expect *some*
check problems), e.g. producing new warnings if packages use the
current R <= 3.4.x  signature of `row.names<-`

But I think the new feature of allowing indicating on how to treat
invalid row names --- notably, allowing to use  make.names(*, unique=TRUE)
getting valid row names --- is attractive and leads to Martyn's
proposed behavior which entails that  as.data.frame.*(x)  (and
similar coercions to data frames) should typically _handle_
invalid row names rather than signal errors.

Feedback is welcome !

((though I will be slow in replying, going basicaly off work for
  my early-starting weekend in the Alps))

Martin Maechler,
ETH Zurich