[Rd] vctrs: a type system for the tidyverse
Martin Maechler
m@echler @ending from @t@t@m@th@ethz@ch
Wed Aug 8 17:54:16 CEST 2018
>>>>> Hadley Wickham
>>>>> on Wed, 8 Aug 2018 09:34:42 -0500 writes:
>>>> Method dispatch for `vec_c()` is quite simple because
>>>> associativity and commutativity mean that we can
>>>> determine the output type only by considering a pair of
>>>> inputs at a time. To this end, vctrs provides
>>>> `vec_type2()` which takes two inputs and returns their
>>>> common type (represented as zero length vector):
>>>>
>>>> str(vec_type2(integer(), double())) #> num(0)
>>>>
>>>> str(vec_type2(factor("a"), factor("b"))) #> Factor w/ 2
>>>> levels "a","b":
>>>
>>>
>>> What is the reasoning behind taking the union of the
>>> levels here? I'm not sure that is actually the behavior
>>> I would want if I have a vector of factors and I try to
>>> append some new data to it. I might want/ expect to
>>> retain the existing levels and get either NAs or an
>>> error if the new data has (present) levels not in the
>>> first data. The behavior as above doesn't seem in-line
>>> with what I understand the purpose of factors to be
>>> (explicit restriction of possible values).
>>
>> Originally (like a week ago 😀), we threw an error if the
>> factors didn't have the same level, and provided an
>> optional coercion to character. I decided that while
>> correct (the factor levels are a parameter of the type,
>> and hence factors with different levels aren't
>> comparable), that this fights too much against how people
>> actually use factors in practice. It also seems like base
>> R is moving more in this direction, i.e. in 3.4
>> factor("a") == factor("b") is an error, whereas in R 3.5
>> it returns FALSE.
> I now have a better argument, I think:
> If you squint your brain a little, I think you can see
> that each set of automatic coercions is about increasing
> resolution. Integers are low resolution versions of
> doubles, and dates are low resolution versions of
> date-times. Logicals are low resolution version of
> integers because there's a strong convention that `TRUE`
> and `FALSE` can be used interchangeably with `1` and `0`.
> But what is the resolution of a factor? We must take a
> somewhat pragmatic approach because base R often converts
> character vectors to factors, and we don't want to be
> burdensome to users. So we say that a factor `x` has finer
> resolution than factor `y` if the levels of `y` are
> contained in `x`. So to find the common type of two
> factors, we take the union of the levels of each factor,
> given a factor that has finer resolution than
> both. Finally, you can think of a character vector as a
> factor with every possible level, so factors and character
> vectors are coercible.
> (extracted from the in-progress vignette explaining how to
> extend vctrs to work with your own vctrs, now that vctrs
> has been rewritten to use double dispatch)
I like this argumentation, and find it very nice indeed!
It confirms my own gut feeling which had lead me to agreeing
with you, Hadley, that taking the union of all factor levels
should be done here.
As Gabe mentioned (and you've explained about) the term "type"
is really confusing here. As you know, the R internals are all
about SEXPs, TYPEOF(), etc, and that's what the R level
typeof(.) also returns. As you want to use something slightly
different, it should be different naming, ideally something not
existing yet in the R / S world, maybe 'kind' ?
Martin
> Hadley
> --
> http://hadley.nz
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list