[Rd] stringsAsFactors

Michael Dewey info at aghmed.fsnet.co.uk
Wed Feb 13 13:33:20 CET 2013


At 18:01 11/02/2013, Ista Zahn wrote:
>FWIW my view is that for data cleaning and organizing factors just get
>it the way. For modeling I like them because they make it easier to
>understand what is happening. For example I can look at the levels()
>to see what the reference group will be. With characters one has to
>know a) that levels are created in alphabetical order and b) the
>alphabetical order of the the unique values in the character vector.
>Ugh. So my habit is to turn off stringsAsFactors, then explicitly
>convert to factors before modeling (I also use factors to change the
>order in which things are displayed in tables and graphs, another
>place where converting to factors myself is useful but the creating
>them in alphabetical order by default is not)
>
>All this is to say that I would like options(stingsAsFactors=FALSE) to
>be the default, but I like the warning about converting to factors in
>modeling functions because it reminds me that I forgot to covert them,
>which I like to do anyway...

I seem to be one of the few people who find the current default 
helpful. When I read in a dataset I am nearly always going to follow 
it with one or more of the modelling functions and so I do want to 
treat the categorical variables as factors. I cannot off-hand think 
of an example where I have had to convert them to characters.

Incidentally xkcd has, while this discussion has been going on, 
posted something relevant
http://www.xkcd.com/1172/



>Best,
>Ista
>
>On Mon, Feb 11, 2013 at 12:50 PM, Duncan Murdoch
><murdoch.duncan at gmail.com> wrote:
> > On 11/02/2013 12:13 PM, William Dunlap wrote:
> >>
> >> Note that changing this does not just mean getting rid of "silly
> >> warnings".
> >> Currently, predict.lm() can give wrong answers when stringsAsFactors is
> >> FALSE.
> >>
> >>    > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4,
> >> 15:17, 28.1,28.8,30.1))
> >>    > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
> >>    Warning message:
> >>    In model.matrix.default(mt, mf, contrasts) :
> >>      variable 'f' converted to a factor
> >>    > predict(fit_ab, newdata=d)
> >>     1 2 3 4 5 6 7 8 9 10
> >>     1  2  3  4 25 26 27  8  9 10
> >>    Warning messages:
> >>    1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts)
> >> :
> >>      variable 'f' converted to a factor
> >>    2: In predict.lm(fit_ab, newdata = d) :
> >>      prediction from a rank-deficient fit may be misleading
> >>
> >> fit_ab is not rank-deficient and the predict should report
> >>     1 2 3 4 NA NA NA 28 29 30
> >
> >
> > In R-devel, the two warnings about factor conversions are no longer given,
> > but the predictions are the same and the warning about rank 
> deficiency still
> > shows up.  If f is set to be a factor, an error is generated:
> >
> > Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
> > object$xlevels) :
> >   factor f has new levels B
> >
> > I think both the warning and error are somewhat reasonable responses.  The
> > fit is rank deficient relative to the model that includes f == 
> "B",  because
> > the column of the design matrix corresponding to f level B would be
> > completely zero.  In this particular model, we could still do predictions
> > for the other levels, but it also seems reasonable to quit, given that
> > clearly something has gone wrong.
> >
> > I do think that it's unfortunate that we don't get the same result in both
> > cases, and I'd like to have gotten the predictions you suggested, but I
> > don't think that's going to happen.  The reason for the difference is that
> > the subsetting is done before the conversion to a factor, but I think that
> > is unavoidable without really big changes.
> >
> > Duncan Murdoch
> >
> >
> >
> >>
> >> Bill Dunlap
> >> Spotfire, TIBCO Software
> >> wdunlap tibco.com
> >>
> >> > -----Original Message-----
> >> > From: r-devel-bounces at r-project.org
> >> > [mailto:r-devel-bounces at r-project.org] On Behalf
> >> > Of Terry Therneau
> >> > Sent: Monday, February 11, 2013 5:50 AM
> >> > To: r-devel at r-project.org; Duncan Murdoch
> >> > Subject: Re: [Rd] stringsAsFactors
> >> >
> >> > I think your idea to remove the warnings is excellent, and a good
> >> > compromise.
> >> > Characters
> >> > already work fine in modeling functions except for the silly warning.
> >> >
> >> > It is interesting how often the defaults for a program reflect the data
> >> > sets in use at the
> >> > time the defaults were chosen.  There are some such in my own survival
> >> > package whose
> >> > proper value is no longer as "obvious" as it was when I chose them.
> >> > Factors are very
> >> > handy for variables which have only a few levels and will be used in
> >> > modeling.  Every
> >> > character variable of every dataset in "Statistical Models in S", which
> >> > introduced
> >> > factors, is of this type so auto-transformation made a lot of sense.
> >> > The "solder" data
> >> > set there is one for which Helmert contrasts are proper so guess what
> >> > the default
> >> > contrast
> >> > option was?  (I think there are only a few data sets in the world for
> >> > which Helmert makes
> >> > sense, however, and R eventually changed the default.)
> >> >
> >> > For character variables that should not be factors such as a street
> >> > adress
> >> > stringsAsFactors can be a real PITA, and I expect that people's
> >> > preference for the option
> >> > depends almost entirely on how often these arise in their own work.  As
> >> > long as there is
> >> > an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the
> >> > default, partly
> >> > because the current value is a tripwire in the hallway that eventually
> >> > catches every new
> >> > user.
> >> >
> >> > Terry Therneau
> >> >
> >> > On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
> >> > > Both of these were discussed by R Core.  I think it's unlikely the
> >> > > default for stringsAsFactors will be changed (some R Core members like
> >> > > the current behaviour), but it's fairly likely the show.signif.stars
> >> > > default will change.  (That's if someone gets around to it:  I
> >> > > personally don't care about that one.  P-values are commonly used
> >> > > statistics, and the stars are just a simple graphical display of them.
> >> > > I find some p-values to be useful, and the display to be harmless.)
> >> > >
> >> > > I think it's really unlikely the more extreme changes (i.e. dropping
> >> > > show.signif.stars completely, or dropping p-values) will happen.
> >> > >
> >> > > Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
> >> > > I'll let the people who like it defend it.  What I will likely do is
> >> > > make a few changes so that character vectors are automatically changed
> >> > > to factors in modelling functions, so that operating with
> >> > > stringsAsFactors=FALSE doesn't trigger silly warnings.
> >> >
> >> > ______________________________________________
> >> > R-devel at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

Michael Dewey
info at aghmed.fsnet.co.uk
http://www.aghmed.fsnet.co.uk/home.html



More information about the R-devel mailing list