[Rd] duplicated factor labels.

peter dalgaard pdalgd at gmail.com
Fri Jun 23 11:51:05 CEST 2017


Hmm, the danger in this is that duplicated factor levels _used_ to be allowed (i.e. multiple codes with the same level). Disallowing it is what broke read.spss() on some files, because SPSS's concept of value labels is not 1-to-1 with factors. 

Reallowing it with different semantics could be premature. I mean, if we hadn't had the "forbidden" step, read.spss() could have changed behaviour unnoticed. So what if there is code relying on duplicate factor levels, which hasn't been run for some time?

-pd

> On 23 Jun 2017, at 10:42 , Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> 
>>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>>    on Thu, 22 Jun 2017 11:43:59 +0200 writes:
> 
>>>>>> Paul Johnson <pauljohn32 at gmail.com>
>>>>>>    on Fri, 16 Jun 2017 11:02:34 -0500 writes:
> 
>>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <jorismeys at gmail.com> wrote:
>>>> To extwnd on Martin 's explanation :
>>>> 
>>>> In factor(), levels are the unique input values and labels the unique output
>>>> values. So the function levels() actually displays the labels.
>>>> 
> 
>>> Dear Joris
> 
>>> I think we agree. Currently, factor insists both levels and labels be unique.
> 
>>> I wish that it would not accept nonunique labels. I also understand it
>>> is impractical to change this now in base R.
> 
>>> I don't think I succeeded in explaining why this would be nicer.
>>> Here's another example. Fairly often, we see input data like
> 
>>> x <- c("Male", "Man", "male", "Man", "Female")
> 
>>> The first four represent the same value.  I'd like to go in one step
>>> to a new factor variable with enumerated types "Male" and "Female".
>>> This fails
> 
>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
>>> labels = c("Male", "Male", "Male", "Female"))
> 
>>> Instead, we need 2 steps.
> 
>>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
>>> levels(xf) <- c("Male", "Male", "Male", "Female")
> 
>>> I think it is quirky that `levels<-.factor` allows the duplicated
>>> labels, whereas factor does not.
> 
>>> I wrote a function rockchalk::combineLevels to simplify combining
>>> levels, but most of the students here like plyr::mapvalues to do it.
>>> The use of levels() can be tricky because one must enumerate all
>>> values, not just the ones being changed.
> 
>>> But I do understand Martin's point. Its been this way 25 years, it
>>> won't change. :).
> 
>> Well.. the above is a bit out of context.
> 
>> Your first example really did not make a point to me (and Joris)
>> and I showed that you could use even two different simple factor() calls to
>> produce what you wanted 
>> yc <- factor(c("1",NA,NA,"4","4","4"))
>> yn <- factor(c( 1, NA,NA, 4,  4,  4))
> 
>> Your new example is indeed  much more convincing !
> 
>> (Note though that the two steps that are needed can be written 
>> more shortly
> 
>> The  "been this way 25 years"  is one a reason to be very
>> cautious(*) with changes, but not a reason for no changes!
> 
>> (*) Indeed as some of you have noted we really should not "break behavior".
>> This means to me we cannot accept a change there which gives
>> an error or a different result in cases the old behavior gave a valid factor.
> 
>> I'm looking at a possible change currently
>> [not promising that a change will happen ...]
> 
> In the end, I've liked the change (after 2-3 iterations), and
> now been brave to commit to R-devel (svn 72845).
> 
> With the change, I had to disable one of our own regression
> checks (tests/reg-tests-1b.R, line 726):
> 
> The following is now (in R-devel -> R 3.5.0) valid:
> 
>> factor(1:2, labels = c("A","A"))
>   [1] A A
>   Levels: A
>> 
> 
> I wonder how many CRAN package checks will "break" from
> this (my guess is in the order of a dozen), but I hope
> that these breakages will be benign, e.g., similar to the above
> case where before an error was expected via tools :: assertError(.)
> 
> Martin
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list