[Rd] factors with non-unique ("duplicated") levels have been deprecated since 2009 -- are *more* deprecated now -- and why you should be hesitant misusing suppressWarnings()

Martin Maechler maechler at stat.math.ethz.ch
Sat Jun 4 19:32:04 CEST 2016


>From this bug report (it's a proposal for speedup only, not a bug),
   https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16895#c6
the fact that you can construct factors with non-unique aka
"duplicated" levels in R  has been re-raised.  As mentioned there,
we had a small discussion here (on 'R-devel') a bit more than 7 years
ago,  where I had said that indeed R core had decided
that factors with duplicated levels will be deprecated from R version
2.10.0 on ... indeed a while ago.

As factors are not S4 objects, there is no really formal class
definition and no inherent class validation, but even then in 2009, we
had changed
`levels<-` such that it raised a warning when the levels were not unique:

> aba <- c("a","b","a"); x <- factor(aba, levels=aba)
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else
paste0(labels,  :
  duplicated levels in factors are deprecated
>

We've finally decided to make this an error in R-devel  (which is
planned for release, probably as R 3.4.0, in April 2017):

> aba <- c("a","b","a"); x <- factor(aba, levels=aba)
Error in `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels,  :
  factor level [3] is duplicated
>

If you know R well, you'll know that it is still very easy to
construct factors in R with invalid levels.
For this reason, also *printing* such factors now produces a warning:

> f
[1] 1 2 2 3 3 2 2 1
Levels: 1 2 2 3
Warning message:
In print.factor(x) : duplicated level [3] in factor
>
----------------------------------------------------------------------------------------

We have found at least two packages that are affected by this change
by no longer passing 'R CMD check' on R-devel:
1) plyr --- but there it is just a check which has previously checked
the *warning* mentioned above, which now is an error.  So only the
check must be amended (quite easily)
2) MicroDatosEs: now fails in  example(censo2010).
  and that is the reason for this posting:   I would claim that it is
not primarily the fault of 'MicroDatosEs' maintainer,  but actually of
a package that it depends on, 'memisc'.
 Now that has a "nice" S4 method for producing  factor from
"item.vector"  (though I would find an  as(..) method [defined via
setAs(..)] much more natural than an 'as.factor()' method) :

> selectMethod("as.factor", "item.vector")
Method Definition:

function (x)
{
    labels <- x at value.labels
    if (length(labels)) {
        values <- labels at values
        labels <- labels at .Data
    }
    else {
        values <- labels <- sort(unique(x at .Data))
    }
    filter <- x at value.filter
    use.levels <- if (length(filter))
        is.valid2(values, filter)
    else TRUE
    f <- suppressWarnings(factor(x at .Data, levels = values[use.levels],
        labels = labels[use.levels]))
    if (length(attr(x, "contrasts")))
        contrasts(f) <- contrasts(x)
    f
}
<environment: namespace:memisc>


and the  suppressWarnings(..)   has  "ensured"  all these years since
2009  that users and package writer were never alerted to the
programming "glitch" (of not ensuring levels/labels were correct.
They should have seen that factor() was called sometimes in situations
it produced an invalid factor namely one where some levels were
duplicated, and so the memisc authors could have
ensured that the above method would produce correct factors.

Martin Maechler,
R core team / ETH Zurich



More information about the R-devel mailing list