[Rd] vctrs: a type system for the tidyverse
Iñaki Úcar
i@uc@r86 @ending from gm@il@com
Wed Aug 8 19:47:38 CEST 2018
El mié., 8 ago. 2018 a las 19:23, Gabe Becker (<becker.gabe using gene.com>) escribió:
>
> Actually, I sent that too quickly, I should have let it stew a bit more.
> I've changed my mind about the resolution argument I Was trying to make.
> There is more information, technically speaking, in the factor with empty
> levels. I'm still not convinced that its the right behavior, personally. It
> may just be me though, since Martin seems on board. Mostly I'm just very
> wary of taking away the thing about factors that makes them fundamentally
> not characters, and removing the effectiveness of the level restriction, in
> practice, does that.
For what it's worth, I always thought about factors as fundamentally
characters, but with restrictions: a subspace of all possible strings.
And I'd say that a non-negligible number of R users may think about
them in a similar way.
In fact, if you search "concatenation factors", you'll see that back
in 2008 somebody asked on R-help [1] because he wanted to do exactly
what Hadley is describing (i.e., concatenation as character with
levels as a union of the levels), and he was surprised because...
well, the behaviour of c.factor is quite surprising if you don't read
the manual.
BTW, the solution proposed was unlist(list(fct1, fct2)).
[1] https://www.mail-archive.com/r-help@r-project.org/msg38360.html
Iñaki
>
> Best,
> ~G
>
> On Wed, Aug 8, 2018 at 8:54 AM, Martin Maechler <maechler using stat.math.ethz.ch>
> wrote:
>
> > >>>>> Hadley Wickham
> > >>>>> on Wed, 8 Aug 2018 09:34:42 -0500 writes:
> >
> > >>>> Method dispatch for `vec_c()` is quite simple because
> > >>>> associativity and commutativity mean that we can
> > >>>> determine the output type only by considering a pair of
> > >>>> inputs at a time. To this end, vctrs provides
> > >>>> `vec_type2()` which takes two inputs and returns their
> > >>>> common type (represented as zero length vector):
> > >>>>
> > >>>> str(vec_type2(integer(), double())) #> num(0)
> > >>>>
> > >>>> str(vec_type2(factor("a"), factor("b"))) #> Factor w/ 2
> > >>>> levels "a","b":
> > >>>
> > >>>
> > >>> What is the reasoning behind taking the union of the
> > >>> levels here? I'm not sure that is actually the behavior
> > >>> I would want if I have a vector of factors and I try to
> > >>> append some new data to it. I might want/ expect to
> > >>> retain the existing levels and get either NAs or an
> > >>> error if the new data has (present) levels not in the
> > >>> first data. The behavior as above doesn't seem in-line
> > >>> with what I understand the purpose of factors to be
> > >>> (explicit restriction of possible values).
> > >>
> > >> Originally (like a week ago ), we threw an error if the
> > >> factors didn't have the same level, and provided an
> > >> optional coercion to character. I decided that while
> > >> correct (the factor levels are a parameter of the type,
> > >> and hence factors with different levels aren't
> > >> comparable), that this fights too much against how people
> > >> actually use factors in practice. It also seems like base
> > >> R is moving more in this direction, i.e. in 3.4
> > >> factor("a") == factor("b") is an error, whereas in R 3.5
> > >> it returns FALSE.
> >
> > > I now have a better argument, I think:
> >
> > > If you squint your brain a little, I think you can see
> > > that each set of automatic coercions is about increasing
> > > resolution. Integers are low resolution versions of
> > > doubles, and dates are low resolution versions of
> > > date-times. Logicals are low resolution version of
> > > integers because there's a strong convention that `TRUE`
> > > and `FALSE` can be used interchangeably with `1` and `0`.
> >
> > > But what is the resolution of a factor? We must take a
> > > somewhat pragmatic approach because base R often converts
> > > character vectors to factors, and we don't want to be
> > > burdensome to users. So we say that a factor `x` has finer
> > > resolution than factor `y` if the levels of `y` are
> > > contained in `x`. So to find the common type of two
> > > factors, we take the union of the levels of each factor,
> > > given a factor that has finer resolution than
> > > both. Finally, you can think of a character vector as a
> > > factor with every possible level, so factors and character
> > > vectors are coercible.
> >
> > > (extracted from the in-progress vignette explaining how to
> > > extend vctrs to work with your own vctrs, now that vctrs
> > > has been rewritten to use double dispatch)
> >
> > I like this argumentation, and find it very nice indeed!
> > It confirms my own gut feeling which had lead me to agreeing
> > with you, Hadley, that taking the union of all factor levels
> > should be done here.
> >
> > As Gabe mentioned (and you've explained about) the term "type"
> > is really confusing here. As you know, the R internals are all
> > about SEXPs, TYPEOF(), etc, and that's what the R level
> > typeof(.) also returns. As you want to use something slightly
> > different, it should be different naming, ideally something not
> > existing yet in the R / S world, maybe 'kind' ?
> >
> > Martin
> >
> >
> > > Hadley
> >
> > > --
> > > http://hadley.nz
> >
> > > ______________________________________________
> > > R-devel using r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>
> --
> Gabriel Becker, Ph.D
> Scientist
> Bioinformatics and Computational Biology
> Genentech Research
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list