[Rd] Regression stars

Duncan Murdoch murdoch.duncan at gmail.com
Wed Feb 13 14:17:44 CET 2013

On 13-02-13 7:25 AM, peter dalgaard wrote:
> On Feb 12, 2013, at 20:19 , Duncan Murdoch wrote:
>> I think you are misreading what Peter wrote.  He wasn't defending
>> that point of view, he was describing it.
> Yes. However, that being said, there is the point that the whole
> thing has been designed to work within the paradigm that I described,
> and, for better or worse, things are reasonably coherent and
> consistent within that framework.
> The thing that always worries me, when people get bothered by some
> aspect of software design, is that, if you change only that aspect,
> you may find yourself with something that is incoherent and
> inconsistent. I have quite a few times found myself realizing that
> "Uncle John was right after all".
> For instance, if you change the paradigm to say that "character
> variables are character, unless explicitly turned into factors", and
> then ameliorate the inconvenience by changing code that relies on
> factors to convert character variables on the fly, then you will lose
> the otherwise automatic consistency of level sets between subsets of
> data. (So, the math department not only has zero female professors,
> the entire female gender ceases to exist for that subgroup.)

Sure, if I have a file that contains a column named Sex and it is all M,
I can't expect R to automatically know that there is another
possibility.  That's always been a problem.  If we automatically convert
the data to factors when we read, then maybe we'll be lucky and some
other part of that file that we're planning to throw away will contain
an F, and we'll automatically construct the right factor.
(Except we don't:  lm and glm will throw away the F level if there are
none in the subset we pass to them, factor or not, because they use
drop.unused.levels=TRUE in their call to model.frame().)

There's also the possibility that there will be m and f in there, and
we'll get it wrong.

In R 2.15.2, we do the automatic conversion with a warning, but we do it
wrong, which leads to the inconsistency that Bill Dunlap reported.
R-devel drops the warning and comes closer to getting it right, but it's
really an impossible problem:  if we never see an F, we'll never set the
levels of the factor properly.  If we see a typo like m or f and don't
realize it's a typo, we'll have more than two Sex values.

The current R-devel implementation delays the conversion as much as it
can, and maybe it delays it too far.  It allows model.frame() to
continue to return character columns, as it does in 2.15.2.  This was to
support xtabs(), which treats character columns differently from
factors, and other unforeseen uses.  Another possibility would be to add
an argument ("stringsAsFactors"?) to model.frame() to let modelling
functions choose whether they want factors or not.  xtabs() would say
no, lm() and glm() would say yes.  I think the current implementation is
preferable because it won't require changes to well written existing

With the current R-devel implementation, it is easier than in 2.15.2 to
get errors thrown when the auto-conversion goes wrong.  I don't know of
any examples where you get incorrect results.  I think this is an

I'd appreciate hearing of any bugs in it.

Duncan Murdoch

More information about the R-devel mailing list