[Rd] stringsAsFactors

Terry Therneau therneau at mayo.edu
Mon Feb 11 20:34:04 CET 2013


The root of this problem is that the .getXlevels function does not return the levels for 
character variables.
Future predictions depend on that information.

On 02/11/2013 11:50 AM, Duncan Murdoch wrote:
> On 11/02/2013 12:13 PM, William Dunlap wrote:
>> Note that changing this does not just mean getting rid of "silly warnings".
>> Currently, predict.lm() can give wrong answers when stringsAsFactors is FALSE.
>>
>> > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4, 15:17, 
>> 28.1,28.8,30.1))
>> > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
>>    Warning message:
>>    In model.matrix.default(mt, mf, contrasts) :
>>      variable 'f' converted to a factor
>> > predict(fit_ab, newdata=d)
>>     1  2  3  4  5  6  7  8  9 10
>>     1  2  3  4 25 26 27  8  9 10
>>    Warning messages:
>>    1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
>>      variable 'f' converted to a factor
>>    2: In predict.lm(fit_ab, newdata = d) :
>>      prediction from a rank-deficient fit may be misleading
>>
>> fit_ab is not rank-deficient and the predict should report
>>     1 2 3 4 NA NA NA 28 29 30
>
> In R-devel, the two warnings about factor conversions are no longer given, but the 
> predictions are the same and the warning about rank deficiency still shows up.  If f is 
> set to be a factor, an error is generated:
>
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = 
> object$xlevels) :
>   factor f has new levels B
>
> I think both the warning and error are somewhat reasonable responses.  The fit is rank 
> deficient relative to the model that includes f == "B",  because the column of the 
> design matrix corresponding to f level B would be completely zero.  In this particular 
> model, we could still do predictions for the other levels, but it also seems reasonable 
> to quit, given that clearly something has gone wrong.
>
> I do think that it's unfortunate that we don't get the same result in both cases, and 
> I'd like to have gotten the predictions you suggested, but I don't think that's going to 
> happen.  The reason for the difference is that the subsetting is done before the 
> conversion to a factor, but I think that is unavoidable without really big changes.
>
> Duncan Murdoch
>
>
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>> > -----Original Message-----
>> > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf
>> > Of Terry Therneau
>> > Sent: Monday, February 11, 2013 5:50 AM
>> > To: r-devel at r-project.org; Duncan Murdoch
>> > Subject: Re: [Rd] stringsAsFactors
>> >
>> > I think your idea to remove the warnings is excellent, and a good compromise.
>> > Characters
>> > already work fine in modeling functions except for the silly warning.
>> >
>> > It is interesting how often the defaults for a program reflect the data sets in use 
>> at the
>> > time the defaults were chosen.  There are some such in my own survival package whose
>> > proper value is no longer as "obvious" as it was when I chose them.  Factors are very
>> > handy for variables which have only a few levels and will be used in modeling.  Every
>> > character variable of every dataset in "Statistical Models in S", which introduced
>> > factors, is of this type so auto-transformation made a lot of sense.  The "solder" data
>> > set there is one for which Helmert contrasts are proper so guess what the default
>> > contrast
>> > option was?  (I think there are only a few data sets in the world for which Helmert 
>> makes
>> > sense, however, and R eventually changed the default.)
>> >
>> > For character variables that should not be factors such as a street adress
>> > stringsAsFactors can be a real PITA, and I expect that people's preference for the 
>> option
>> > depends almost entirely on how often these arise in their own work.  As long as there is
>> > an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the default, partly
>> > because the current value is a tripwire in the hallway that eventually catches every new
>> > user.
>> >
>> > Terry Therneau
>> >
>> > On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
>> > > Both of these were discussed by R Core.  I think it's unlikely the
>> > > default for stringsAsFactors will be changed (some R Core members like
>> > > the current behaviour), but it's fairly likely the show.signif.stars
>> > > default will change.  (That's if someone gets around to it:  I
>> > > personally don't care about that one.  P-values are commonly used
>> > > statistics, and the stars are just a simple graphical display of them.
>> > > I find some p-values to be useful, and the display to be harmless.)
>> > >
>> > > I think it's really unlikely the more extreme changes (i.e. dropping
>> > > show.signif.stars completely, or dropping p-values) will happen.
>> > >
>> > > Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
>> > > I'll let the people who like it defend it.  What I will likely do is
>> > > make a few changes so that character vectors are automatically changed
>> > > to factors in modelling functions, so that operating with
>> > > stringsAsFactors=FALSE doesn't trigger silly warnings.
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list