[Rd] RE: [R] using poly in a linear regression in the presence of
NAf ails (despite subsetting them out)
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sat Feb 19 07:09:10 CET 2005
On Fri, 18 Feb 2005, John Fox wrote:
> Dear Andy, Brian, and Markus,
>
> I've moved this to r-devel because the issue is a bit esoteric. I apologize
> for joining the discussion so late, but didn't have time earlier in the week
> to formulate these ideas.
I think you didn't take time to check what happens in R-devel, though.
> I believe that I understand and appreciate Brian's point, but think that the
> issue isn't entirely clear-cut.
>
> It seems to me that two kinds of questions arise with missing data. The
> deeper ones are statistical but there are also nontrivial mechanical issues
> such as the one here.
>
> Whenever a term in a model is parametrized with a data-dependent basis,
> there's a potential for problems and confusion -- manifested, for example,
> in the issue of "safe" prediction. I don't think that these problems are
> unique to missing data. On the other hand, the basis selected for the
> subspace spanned by a term shouldn't be the most important consideration.
> That is, when models are equivalent -- as, for example lm(y ~ x + I(x^2))
> and lm(y ~ poly(x, 2)), an argument could be made for treating them
> similarly.
Only in linear models, and we are here talking about general behaviour of
finding the model frame.
> Brian's point that NAs, say, in x2, can influence the basis for poly(x1, 2)
> is disquieting, but note that this can happen now if there are no NAs in x1.
That is not what I said, and it is incorrect. poly(x1, 2) is determined
by the set of values of x1 (so they need all to be known hence no NAs in
x1) and nothing else. Example:
> y <- rnorm(10)
> x1 <- 1:10
> x2 <- c(NA, runif(9))
> model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit)
y poly(x1, 2).1 poly(x1, 2).2 x2
2 -0.5110095 -0.38533732 0.17407766 0.07377988
3 -0.9111954 -0.27524094 -0.08703883 0.30968660
4 -0.8371717 -0.16514456 -0.26111648 0.71727174
5 2.4158352 -0.05504819 -0.34815531 0.50454591
6 0.1340882 0.05504819 -0.34815531 0.15299896
7 -0.4906859 0.16514456 -0.26111648 0.50393349
8 -0.4405479 0.27524094 -0.08703883 0.49396092
9 0.4595894 0.38533732 0.17407766 0.75120020
10 -0.6937202 0.49543369 0.52223297 0.17464982
> model.frame(y ~ poly(x1, 2) + x2, na.action = na.pass)
y poly(x1, 2).1 poly(x1, 2).2 x2
1 -0.1102855 -0.49543369 0.52223297 NA
2 -0.5110095 -0.38533732 0.17407766 0.07377988
3 -0.9111954 -0.27524094 -0.08703883 0.30968660
4 -0.8371717 -0.16514456 -0.26111648 0.71727174
5 2.4158352 -0.05504819 -0.34815531 0.50454591
6 0.1340882 0.05504819 -0.34815531 0.15299896
7 -0.4906859 0.16514456 -0.26111648 0.50393349
8 -0.4405479 0.27524094 -0.08703883 0.49396092
9 0.4595894 0.38533732 0.17407766 0.75120020
10 -0.6937202 0.49543369 0.52223297 0.17464982
> The point, therefore, doesn't really justify the current behaviour of
> poly(). Indeed, if there are NAs in x2 but not in x1, the columns
> representing poly(x1, 2) won't be orthogonal in the subset of cases used in
> the model fit (though they do provide a correct basis for the term).
It is currently documented to not allow NAs:
x, newdata: a numeric vector at which to evaluate the polynomial. 'x'
can also be a matrix. Missing values are not allowed in 'x'.
Why does that need any justification?
> Consider another example -- lm(y ~ f, subset = f != "a"), where f is a
> factor with levels "a", "b", and "c", and where there are NAs in f. Here the
> basis for the f term is data dependent, in that the baseline level is taken
> as "b" in the absence of "a", yet NAs don't cause an error.
>
> Having poly(), bs(), and ns() report a more informative error message
> certainly is preferable to the current situation, but an alternative would
> be to have them work and perhaps report a warning.
What exactly does `work' mean here? Run and give misleading answers?
> If the object is to protect people from stepping into traps because of
> missing data, then an argument could be made for having the default
> na.action be na.fail, as in S-PLUS. (I wouldn't advocate this, by the way.)
> Perhaps I'm missing some consideration here, but isn't it clearest to allow
> the data, subset, and na.action arguments to determine the data in which the
> formula is evaluated?
Exactly how? It would be a major change from the current methodology and
I am not sure you appreciate what that is.
Remember that na.action could for example replace missing values by random
imputations, or even multiple imputations, and na.action comes quite late
in the process *after* the model frame has been formed. Unlike factors,
poly() etc generate multiple columns in the model frame, and then subset
and na.action are applied. Factors are encoded in model.matrix (and that
is really only used for the linear parts of models, unlike model.frame).
I am not sure you appreciate the basic point: poly is applied to the whole
dataset and not to a subset, and that can be important (e.g. to ensure no
unstable extrapolation when predicting later). Really na.action has to
come last, as functions in the formula could themselves generate NAs
(log(0) for example).
There is an alternative, that poly() works only on finite values and
passes through non-finite ones, but it was deliberately not written that
way and you will need to convince everyone that would be a better
solution. (What happens for example if there are only two finite values?)
If you want such a function, it is easy for you to provide it: just please
don't call it poly(). [Writing poly() was not easy BTW, but
poly_allow_NAs would be given poly.]
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list