[Rd] stringsAsFactors

Mon Feb 11 18:13:09 CET 2013

Note that changing this does not just mean getting rid of "silly warnings".
Currently, predict.lm() can give wrong answers when stringsAsFactors is FALSE.

  > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4, 15:17, 28.1,28.8,30.1))
  > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
  Warning message:
  In model.matrix.default(mt, mf, contrasts) :
    variable 'f' converted to a factor
  > predict(fit_ab, newdata=d)
   1  2  3  4  5  6  7  8  9 10
   1  2  3  4 25 26 27  8  9 10
  Warning messages:
  1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
    variable 'f' converted to a factor
  2: In predict.lm(fit_ab, newdata = d) :
    prediction from a rank-deficient fit may be misleading

fit_ab is not rank-deficient and the predict should report
   1 2 3 4 NA NA NA 28 29 30 

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf
> Of Terry Therneau
> Sent: Monday, February 11, 2013 5:50 AM
> To: r-devel at r-project.org; Duncan Murdoch
> Subject: Re: [Rd] stringsAsFactors
> 
> I think your idea to remove the warnings is excellent, and a good compromise.
> Characters
> already work fine in modeling functions except for the silly warning.
> 
> It is interesting how often the defaults for a program reflect the data sets in use at the
> time the defaults were chosen.  There are some such in my own survival package whose
> proper value is no longer as "obvious" as it was when I chose them.  Factors are very
> handy for variables which have only a few levels and will be used in modeling.  Every
> character variable of every dataset in "Statistical Models in S", which introduced
> factors, is of this type so auto-transformation made a lot of sense.  The "solder" data
> set there is one for which Helmert contrasts are proper so guess what the default
> contrast
> option was?  (I think there are only a few data sets in the world for which Helmert makes
> sense, however, and R eventually changed the default.)
> 
> For character variables that should not be factors such as a street adress
> stringsAsFactors can be a real PITA, and I expect that people's preference for the option
> depends almost entirely on how often these arise in their own work.  As long as there is
> an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the default, partly
> because the current value is a tripwire in the hallway that eventually catches every new
> user.
> 
> Terry Therneau
> 
> On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
> > Both of these were discussed by R Core.  I think it's unlikely the
> > default for stringsAsFactors will be changed (some R Core members like
> > the current behaviour), but it's fairly likely the show.signif.stars
> > default will change.  (That's if someone gets around to it:  I
> > personally don't care about that one.  P-values are commonly used
> > statistics, and the stars are just a simple graphical display of them.
> > I find some p-values to be useful, and the display to be harmless.)
> >
> > I think it's really unlikely the more extreme changes (i.e. dropping
> > show.signif.stars completely, or dropping p-values) will happen.
> >
> > Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
> > I'll let the people who like it defend it.  What I will likely do is
> > make a few changes so that character vectors are automatically changed
> > to factors in modelling functions, so that operating with
> > stringsAsFactors=FALSE doesn't trigger silly warnings.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel