[Rd] vctrs: a type system for the tidyverse

Gabe Becker becker@g@be @ending from gene@com
Wed Aug 8 17:56:00 CEST 2018


Hadley,

Responses inline.

On Wed, Aug 8, 2018 at 7:34 AM, Hadley Wickham <h.wickham using gmail.com> wrote:

> >>> Method dispatch for `vec_c()` is quite simple because associativity and
> >>> commutativity mean that we can determine the output type only by
> >>> considering a pair of inputs at a time. To this end, vctrs provides
> >>> `vec_type2()` which takes two inputs and returns their common type
> >>> (represented as zero length vector):
> >>>
> >>>     str(vec_type2(integer(), double()))
> >>>     #>  num(0)
> >>>
> >>>     str(vec_type2(factor("a"), factor("b")))
> >>>     #>  Factor w/ 2 levels "a","b":
> >>
> >>
> >> What is the reasoning behind taking the union of the levels here? I'm
> not
> >> sure that is actually the behavior I would want if I have a vector of
> >> factors and I try to append some new data to it. I might want/ expect to
> >> retain the existing levels and get either NAs or an error if the new
> data
> >> has (present) levels not in the first data. The behavior as above
> doesn't
> >> seem in-line with what I understand the purpose of factors to be
> (explicit
> >> restriction of possible values).
> >
> > Originally (like a week ago 😀), we threw an error if the factors
> > didn't have the same level, and provided an optional coercion to
> > character. I decided that while correct (the factor levels are a
> > parameter of the type, and hence factors with different levels aren't
> > comparable), that this fights too much against how people actually use
> > factors in practice. It also seems like base R is moving more in this
> > direction, i.e. in 3.4 factor("a") == factor("b") is an error, whereas
> > in R 3.5 it returns FALSE.
>
> I now have a better argument, I think:
>
> If you squint your brain a little, I think you can see that each set
> of automatic coercions is about increasing resolution. Integers are
> low resolution versions of doubles, and dates are low resolution
> versions of date-times. Logicals are low resolution version of
> integers because there's a strong convention that `TRUE` and `FALSE`
> can be used interchangeably with `1` and `0`.
>
> But what is the resolution of a factor? We must take a somewhat
> pragmatic approach because base R often converts character vectors to
> factors, and we don't want to be burdensome to users.


I don't know, I personally just don't buy this line of reasoning. Yes, you
can convert between characters and factors, but that doesn't make factors
"a special kind of character", which you seem to be implicitly arguing they
are. Fundamentally they are different objects with different purposes. As I
said in my previous email, the primary semantic purpose of factors is value
restriction. You don't WANT to increase the set of levels when your set of
values has already been carefully curated. Certainly not automagically.


> So we say that a
> factor `x` has finer resolution than factor `y` if the levels of `y`
> are contained in `x`. So to find the common type of two factors, we
> take the union of the levels of each factor, given a factor that has
> finer resolution than both.


I'm not so sure. I think a more useful definition of resolution may be that
it is about increasing the precision of information. In that case, a factor
with 4 levels each of which is present has a *higher* resolution than the
same data with additional-but-absent levels on the factor object.  Now that
may be different when the the new levels are not absent, but my point is
that its not clear to me that resolution is a useful way of talking about
factors.


> Finally, you can think of a character
> vector as a factor with every possible level, so factors and character
> vectors are coercible.
>



If users want unrestricted character type behavior, then IMHO they should
just be using characters, and it's quite easy for them to do so in any case
I can easily think of where they have somehow gotten their hands on a
factor. If, however, they want a factor, it must be - I imagine - because
they actually want the the semantics and behavior *specific* to factors.

Best,
~G


>
> (extracted from the in-progress vignette explaining how to extend
> vctrs to work with your own vctrs, now that vctrs has been rewritten
> to use double dispatch)
>
> Hadley
>
> --
> http://hadley.nz
>



-- 
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

	[[alternative HTML version deleted]]



More information about the R-devel mailing list