therneau at mayo.edu
Mon Feb 11 20:34:04 CET 2013
The root of this problem is that the .getXlevels function does not return the levels for
Future predictions depend on that information.
On 02/11/2013 11:50 AM, Duncan Murdoch wrote:
> On 11/02/2013 12:13 PM, William Dunlap wrote:
>> Note that changing this does not just mean getting rid of "silly warnings".
>> Currently, predict.lm() can give wrong answers when stringsAsFactors is FALSE.
>> > d <- data.frame(x=1:10, f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4, 15:17,
>> > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
>> Warning message:
>> In model.matrix.default(mt, mf, contrasts) :
>> variable 'f' converted to a factor
>> > predict(fit_ab, newdata=d)
>> 1 2 3 4 5 6 7 8 9 10
>> 1 2 3 4 25 26 27 8 9 10
>> Warning messages:
>> 1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
>> variable 'f' converted to a factor
>> 2: In predict.lm(fit_ab, newdata = d) :
>> prediction from a rank-deficient fit may be misleading
>> fit_ab is not rank-deficient and the predict should report
>> 1 2 3 4 NA NA NA 28 29 30
> In R-devel, the two warnings about factor conversions are no longer given, but the
> predictions are the same and the warning about rank deficiency still shows up. If f is
> set to be a factor, an error is generated:
> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
> object$xlevels) :
> factor f has new levels B
> I think both the warning and error are somewhat reasonable responses. The fit is rank
> deficient relative to the model that includes f == "B", because the column of the
> design matrix corresponding to f level B would be completely zero. In this particular
> model, we could still do predictions for the other levels, but it also seems reasonable
> to quit, given that clearly something has gone wrong.
> I do think that it's unfortunate that we don't get the same result in both cases, and
> I'd like to have gotten the predictions you suggested, but I don't think that's going to
> happen. The reason for the difference is that the subsetting is done before the
> conversion to a factor, but I think that is unavoidable without really big changes.
> Duncan Murdoch
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>> > -----Original Message-----
>> > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf
>> > Of Terry Therneau
>> > Sent: Monday, February 11, 2013 5:50 AM
>> > To: r-devel at r-project.org; Duncan Murdoch
>> > Subject: Re: [Rd] stringsAsFactors
>> > I think your idea to remove the warnings is excellent, and a good compromise.
>> > Characters
>> > already work fine in modeling functions except for the silly warning.
>> > It is interesting how often the defaults for a program reflect the data sets in use
>> at the
>> > time the defaults were chosen. There are some such in my own survival package whose
>> > proper value is no longer as "obvious" as it was when I chose them. Factors are very
>> > handy for variables which have only a few levels and will be used in modeling. Every
>> > character variable of every dataset in "Statistical Models in S", which introduced
>> > factors, is of this type so auto-transformation made a lot of sense. The "solder" data
>> > set there is one for which Helmert contrasts are proper so guess what the default
>> > contrast
>> > option was? (I think there are only a few data sets in the world for which Helmert
>> > sense, however, and R eventually changed the default.)
>> > For character variables that should not be factors such as a street adress
>> > stringsAsFactors can be a real PITA, and I expect that people's preference for the
>> > depends almost entirely on how often these arise in their own work. As long as there is
>> > an option that can be overridden I'm okay. Yes, I'd prefer FALSE as the default, partly
>> > because the current value is a tripwire in the hallway that eventually catches every new
>> > user.
>> > Terry Therneau
>> > On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
>> > > Both of these were discussed by R Core. I think it's unlikely the
>> > > default for stringsAsFactors will be changed (some R Core members like
>> > > the current behaviour), but it's fairly likely the show.signif.stars
>> > > default will change. (That's if someone gets around to it: I
>> > > personally don't care about that one. P-values are commonly used
>> > > statistics, and the stars are just a simple graphical display of them.
>> > > I find some p-values to be useful, and the display to be harmless.)
>> > >
>> > > I think it's really unlikely the more extreme changes (i.e. dropping
>> > > show.signif.stars completely, or dropping p-values) will happen.
>> > >
>> > > Regarding stringsAsFactors: I'm not going to defend keeping it as is,
>> > > I'll let the people who like it defend it. What I will likely do is
>> > > make a few changes so that character vectors are automatically changed
>> > > to factors in modelling functions, so that operating with
>> > > stringsAsFactors=FALSE doesn't trigger silly warnings.
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel