[Rd] stringsAsFactors

Duncan Murdoch murdoch.duncan at gmail.com
Wed Feb 13 14:27:30 CET 2013

On 13-02-13 7:33 AM, Michael Dewey wrote:
> At 18:01 11/02/2013, Ista Zahn wrote:
>> FWIW my view is that for data cleaning and organizing factors just get
>> it the way. For modeling I like them because they make it easier to
>> understand what is happening. For example I can look at the levels()
>> to see what the reference group will be. With characters one has to
>> know a) that levels are created in alphabetical order and b) the
>> alphabetical order of the the unique values in the character vector.
>> Ugh. So my habit is to turn off stringsAsFactors, then explicitly
>> convert to factors before modeling (I also use factors to change the
>> order in which things are displayed in tables and graphs, another
>> place where converting to factors myself is useful but the creating
>> them in alphabetical order by default is not)
>> All this is to say that I would like options(stingsAsFactors=FALSE) to
>> be the default, but I like the warning about converting to factors in
>> modeling functions because it reminds me that I forgot to covert them,
>> which I like to do anyway...
> I seem to be one of the few people who find the current default
> helpful. When I read in a dataset I am nearly always going to follow
> it with one or more of the modelling functions and so I do want to
> treat the categorical variables as factors. I cannot off-hand think
> of an example where I have had to convert them to characters.

Please try out the current R-devel (revision 61928 or newer) and let me 
know if anything in your current workflow gets broken by the recent changes.

Duncan Murdoch

> Incidentally xkcd has, while this discussion has been going on,
> posted something relevant
> http://www.xkcd.com/1172/
>> Best,
>> Ista
>> On Mon, Feb 11, 2013 at 12:50 PM, Duncan Murdoch
>> <murdoch.duncan at gmail.com> wrote:
>>> On 11/02/2013 12:13 PM, William Dunlap wrote:
>>>> Note that changing this does not just mean getting rid of "silly
>>>> warnings".
>>>> Currently, predict.lm() can give wrong answers when stringsAsFactors is
>>>> FALSE.
>>>>     > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4,
>>>> 15:17, 28.1,28.8,30.1))
>>>>     > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
>>>>     Warning message:
>>>>     In model.matrix.default(mt, mf, contrasts) :
>>>>       variable 'f' converted to a factor
>>>>     > predict(fit_ab, newdata=d)
>>>>      1 2 3 4 5 6 7 8 9 10
>>>>      1  2  3  4 25 26 27  8  9 10
>>>>     Warning messages:
>>>>     1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts)
>>>> :
>>>>       variable 'f' converted to a factor
>>>>     2: In predict.lm(fit_ab, newdata = d) :
>>>>       prediction from a rank-deficient fit may be misleading
>>>> fit_ab is not rank-deficient and the predict should report
>>>>      1 2 3 4 NA NA NA 28 29 30
>>> In R-devel, the two warnings about factor conversions are no longer given,
>>> but the predictions are the same and the warning about rank
>> deficiency still
>>> shows up.  If f is set to be a factor, an error is generated:
>>> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
>>> object$xlevels) :
>>>    factor f has new levels B
>>> I think both the warning and error are somewhat reasonable responses.  The
>>> fit is rank deficient relative to the model that includes f ==
>> "B",  because
>>> the column of the design matrix corresponding to f level B would be
>>> completely zero.  In this particular model, we could still do predictions
>>> for the other levels, but it also seems reasonable to quit, given that
>>> clearly something has gone wrong.
>>> I do think that it's unfortunate that we don't get the same result in both
>>> cases, and I'd like to have gotten the predictions you suggested, but I
>>> don't think that's going to happen.  The reason for the difference is that
>>> the subsetting is done before the conversion to a factor, but I think that
>>> is unavoidable without really big changes.
>>> Duncan Murdoch
>>>> Bill Dunlap
>>>> Spotfire, TIBCO Software
>>>> wdunlap tibco.com
>>>>> -----Original Message-----
>>>>> From: r-devel-bounces at r-project.org
>>>>> [mailto:r-devel-bounces at r-project.org] On Behalf
>>>>> Of Terry Therneau
>>>>> Sent: Monday, February 11, 2013 5:50 AM
>>>>> To: r-devel at r-project.org; Duncan Murdoch
>>>>> Subject: Re: [Rd] stringsAsFactors
>>>>> I think your idea to remove the warnings is excellent, and a good
>>>>> compromise.
>>>>> Characters
>>>>> already work fine in modeling functions except for the silly warning.
>>>>> It is interesting how often the defaults for a program reflect the data
>>>>> sets in use at the
>>>>> time the defaults were chosen.  There are some such in my own survival
>>>>> package whose
>>>>> proper value is no longer as "obvious" as it was when I chose them.
>>>>> Factors are very
>>>>> handy for variables which have only a few levels and will be used in
>>>>> modeling.  Every
>>>>> character variable of every dataset in "Statistical Models in S", which
>>>>> introduced
>>>>> factors, is of this type so auto-transformation made a lot of sense.
>>>>> The "solder" data
>>>>> set there is one for which Helmert contrasts are proper so guess what
>>>>> the default
>>>>> contrast
>>>>> option was?  (I think there are only a few data sets in the world for
>>>>> which Helmert makes
>>>>> sense, however, and R eventually changed the default.)
>>>>> For character variables that should not be factors such as a street
>>>>> adress
>>>>> stringsAsFactors can be a real PITA, and I expect that people's
>>>>> preference for the option
>>>>> depends almost entirely on how often these arise in their own work.  As
>>>>> long as there is
>>>>> an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the
>>>>> default, partly
>>>>> because the current value is a tripwire in the hallway that eventually
>>>>> catches every new
>>>>> user.
>>>>> Terry Therneau
>>>>> On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
>>>>>> Both of these were discussed by R Core.  I think it's unlikely the
>>>>>> default for stringsAsFactors will be changed (some R Core members like
>>>>>> the current behaviour), but it's fairly likely the show.signif.stars
>>>>>> default will change.  (That's if someone gets around to it:  I
>>>>>> personally don't care about that one.  P-values are commonly used
>>>>>> statistics, and the stars are just a simple graphical display of them.
>>>>>> I find some p-values to be useful, and the display to be harmless.)
>>>>>> I think it's really unlikely the more extreme changes (i.e. dropping
>>>>>> show.signif.stars completely, or dropping p-values) will happen.
>>>>>> Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
>>>>>> I'll let the people who like it defend it.  What I will likely do is
>>>>>> make a few changes so that character vectors are automatically changed
>>>>>> to factors in modelling functions, so that operating with
>>>>>> stringsAsFactors=FALSE doesn't trigger silly warnings.
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> Michael Dewey
> info at aghmed.fsnet.co.uk
> http://www.aghmed.fsnet.co.uk/home.html

More information about the R-devel mailing list