[Rd] RE: [R] using poly in a linear regression in the presence of NAf ails (despite subsetting them out)

Sat Feb 19 14:36:44 CET 2005

Dear Brian,

Whenever I disagree with you, I wonder what error I made, and almost always
discover that there was something I missed.

> -----Original Message-----
> From: r-devel-bounces at stat.math.ethz.ch 
> [mailto:r-devel-bounces at stat.math.ethz.ch] On Behalf Of Prof 
> Brian Ripley
> Sent: Saturday, February 19, 2005 1:09 AM
> To: John Fox
> Cc: 'Markus Jantti'; r-devel at stat.math.ethz.ch; 'Liaw,Andy'
> Subject: [Rd] RE: [R] using poly in a linear regression in 
> the presence of NAf ails (despite subsetting them out)
> 
> On Fri, 18 Feb 2005, John Fox wrote:
> 
> > Dear Andy, Brian, and Markus,
> >
> > I've moved this to r-devel because the issue is a bit esoteric. I 
> > apologize for joining the discussion so late, but didn't have time 
> > earlier in the week to formulate these ideas.
> 
> I think you didn't take time to check what happens in R-devel, though.

I did check the R 2.1 news file for poly, na.action, and na.omit, but didn't
find anything. Did I miss something?

> 
> > I believe that I understand and appreciate Brian's point, but think 
> > that the issue isn't entirely clear-cut.
> >
> > It seems to me that two kinds of questions arise with missing data. 
> > The deeper ones are statistical but there are also nontrivial 
> > mechanical issues such as the one here.
> >
> > Whenever a term in a model is parametrized with a data-dependent 
> > basis, there's a potential for problems and confusion -- 
> manifested, 
> > for example, in the issue of "safe" prediction. I don't think that 
> > these problems are unique to missing data. On the other hand, the 
> > basis selected for the subspace spanned by a term shouldn't 
> be the most important consideration.
> > That is, when models are equivalent -- as, for example lm(y ~ x + 
> > I(x^2)) and lm(y ~ poly(x, 2)), an argument could be made 
> for treating 
> > them similarly.
> 
> Only in linear models, and we are here talking about general 
> behaviour of finding the model frame.
> 

That's a good point -- I hadn't considered it. I don't see the full
ramifications, however.

> > Brian's point that NAs, say, in x2, can influence the basis for 
> > poly(x1, 2) is disquieting, but note that this can happen 
> now if there are no NAs in x1.
> 
> That is not what I said, and it is incorrect.  poly(x1, 2) is 
> determined by the set of values of x1 (so they need all to be 
> known hence no NAs in
> x1) and nothing else.  Example:

What you said was, "The problem is that poly(x, 2) depends on the possible
set of x values, and so needs to know all of them, unlike e.g. log(x) which
is observation-by-observation.  Silently omitting missing values is not a
good idea in such cases, especially if the values are missing in other
variables (which is what na.action is likely to do)." I misinterpreted it.

> 
> > y <- rnorm(10)
> > x1 <- 1:10
> > x2 <- c(NA, runif(9))
> > model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit)
>              y poly(x1, 2).1 poly(x1, 2).2         x2
> 2  -0.5110095   -0.38533732    0.17407766 0.07377988
> 3  -0.9111954   -0.27524094   -0.08703883 0.30968660
> 4  -0.8371717   -0.16514456   -0.26111648 0.71727174
> 5   2.4158352   -0.05504819   -0.34815531 0.50454591
> 6   0.1340882    0.05504819   -0.34815531 0.15299896
> 7  -0.4906859    0.16514456   -0.26111648 0.50393349
> 8  -0.4405479    0.27524094   -0.08703883 0.49396092
> 9   0.4595894    0.38533732    0.17407766 0.75120020
> 10 -0.6937202    0.49543369    0.52223297 0.17464982
> > model.frame(y ~ poly(x1, 2) + x2, na.action = na.pass)
>              y poly(x1, 2).1 poly(x1, 2).2         x2
> 1  -0.1102855   -0.49543369    0.52223297         NA
> 2  -0.5110095   -0.38533732    0.17407766 0.07377988
> 3  -0.9111954   -0.27524094   -0.08703883 0.30968660
> 4  -0.8371717   -0.16514456   -0.26111648 0.71727174
> 5   2.4158352   -0.05504819   -0.34815531 0.50454591
> 6   0.1340882    0.05504819   -0.34815531 0.15299896
> 7  -0.4906859    0.16514456   -0.26111648 0.50393349
> 8  -0.4405479    0.27524094   -0.08703883 0.49396092
> 9   0.4595894    0.38533732    0.17407766 0.75120020
> 10 -0.6937202    0.49543369    0.52223297 0.17464982
> 
> 

The point that I was trying to make is this:

> y <- rnorm(10)
> x1 <- 1:10
> x2 <- c(NA, runif(9))
> mf <- model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit)
> cor(mf[,2])
          1         2
1 1.0000000 0.4037864
2 0.4037864 1.0000000
> 
> Data <- na.omit(data.frame(y, x1, x2))
> mf <- model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit, data=Data)
> cor(mf[,2])
  1 2
1 1 0
2 0 1
> 

> > The point, therefore, doesn't really justify the current 
> behaviour of 
> > poly(). Indeed, if there are NAs in x2 but not in x1, the columns 
> > representing poly(x1, 2) won't be orthogonal in the subset of cases 
> > used in the model fit (though they do provide a correct 
> basis for the term).
> 
> It is currently documented to not allow NAs:
> 
> x, newdata: a numeric vector at which to evaluate the polynomial. 'x'
>            can also be a matrix.  Missing values are not 
> allowed in 'x'.
> 
> Why does that need any justification?
> 

Because one might expect (perhaps, you could argue, one shouldn't expect)
the basis for the polynomial term to be orthogonal in the subset of
observations actually used for the fit. 

> > Consider another example -- lm(y ~ f, subset = f != "a"), 
> where f is a 
> > factor with levels "a", "b", and "c", and where there are NAs in f. 
> > Here the basis for the f term is data dependent, in that 
> the baseline 
> > level is taken as "b" in the absence of "a", yet NAs don't 
> cause an error.
> >
> > Having poly(), bs(), and ns() report a more informative 
> error message 
> > certainly is preferable to the current situation, but an 
> alternative 
> > would be to have them work and perhaps report a warning.
> 
> What exactly does `work' mean here?  Run and give misleading answers?
> 

That implies that there's a simple right answer answer here, which I don't
think is the case. I'm persuaded that what I recommended is not good idea
because of its more general implications, but it's not obvious to me that
creating a basis that's orthogonal in the subset of observations used for
the fit is misleading, while the current behaviour isn't misleading. 

> > If the object is to protect people from stepping into traps 
> because of 
> > missing data, then an argument could be made for having the default 
> > na.action be na.fail, as in S-PLUS. (I wouldn't advocate 
> this, by the 
> > way.) Perhaps I'm missing some consideration here, but isn't it 
> > clearest to allow the data, subset, and na.action arguments to 
> > determine the data in which the formula is evaluated?
> 
> Exactly how?  It would be a major change from the current 
> methodology and I am not sure you appreciate what that is.
> 

Yes, I looked at how this works currently, but agree that I didn't
appreciate the implications of changing it -- e.g., for nonlinear models. I
still don't entirely appreciate those implications.

> Remember that na.action could for example replace missing 
> values by random imputations, or even multiple imputations, 
> and na.action comes quite late in the process *after* the 
> model frame has been formed.  Unlike factors,
> poly() etc generate multiple columns in the model frame, and 
> then subset and na.action are applied.  Factors are encoded 
> in model.matrix (and that is really only used for the linear 
> parts of models, unlike model.frame).
> 
> I am not sure you appreciate the basic point: poly is applied 
> to the whole dataset and not to a subset, and that can be 
> important (e.g. to ensure no unstable extrapolation when 
> predicting later).

Note, however, that a user could create the subset as a data frame through
na.omit() prior to applying poly() -- as I did in the example above. In
fact, that's probably what a user of poly() would now want to do.

>  Really na.action has to come last, as 
> functions in the formula could themselves generate NAs
> (log(0) for example).
> 

I did understand the sequence of operations, but didn't see the point about
newly generated NAs. I agree that this makes it necessary to remove NAs
last.

> There is an alternative, that poly() works only on finite 
> values and passes through non-finite ones, but it was 
> deliberately not written that way and you will need to 
> convince everyone that would be a better solution.  (What 
> happens for example if there are only two finite values?) If 
> you want such a function, it is easy for you to provide it: 
> just please don't call it poly().  [Writing poly() was not 
> easy BTW, but poly_allow_NAs would be given poly.]
> 

I'm no longer convinced that what I proposed is a better solution, but the
current situation seems problematic to me as well.

Thanks for the extended explanation.

John

> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel