[Rd] vctrs: a type system for the tidyverse

Joris Meys jori@mey@ @ending from gm@il@com
Thu Aug 9 10:58:52 CEST 2018

 I sent this to  Iñaki personally by mistake. Thank you for notifying me.

On Wed, Aug 8, 2018 at 7:53 PM Iñaki Úcar <i.ucar86 using gmail.com> wrote:

> For what it's worth, I always thought about factors as fundamentally
> characters, but with restrictions: a subspace of all possible strings.
> And I'd say that a non-negligible number of R users may think about
> them in a similar way.

That idea has been a common source of bugs and the most important reason
why I always explain my students that factors are a special kind of
numeric(integer), not character. Especially people coming from SPSS see
immediately the link with categorical variables in that way, and understand
that a factor is a modeling aid rather than an alternative for characters.
It is a categorical variable and a more readable way of representing a set
of dummy variables.

I do agree that some of the factor behaviour is confusing at best, but that
doesn't change the appropriate use and meaning of factors as categorical

Even more, I oppose the ideas that :

1) factors with different levels should be concatenated.

2) when combining factors, the union of the levels would somehow be a good

Factors with different levels are variables with different information, not
more or less information. If one factor codes low and high and another
codes low, mid and high, you can't say whether mid in one factor would be
low or high in the first one. The second has a higher resolution, and
that's exactly the reason why they should NOT be combined. Different levels
indicate a different grouping, and hence that data should never be used as
one set of dummy variables in any model.

Even when combining factors, the union of levels only makes sense to me if
there's no overlap between levels of both factors. In all other cases, a
researcher will need to determine whether levels with the same label do
mean the same thing in both factors, and that's not guaranteed. And when
we're talking a factor with a higher resolution and a lower resolution, the
correct thing to do modelwise is to recode one of the factors so they have
the same resolution and every level the same definition before you merge
that data.

So imho the combination of two factors with different levels (or even
levels in a different order) should give an error. Which R currently
doesn't throw, so I get there's room for improvement.

Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)

