[Rd] problem with zero-weighted observations in predict.lm?

Peter Dalgaard pdalgd at gmail.com
Thu Jul 29 08:10:59 CEST 2010

Peter Dalgaard wrote:
> William Dunlap wrote:
>> In modelling functions some people like to use
>> a weight of 0 to drop an observation instead of
>> using a subset value of FALSE.  E.g.,
>>   weights=c(0,1,1,...)
>> instead of
>>   subset=c(FALSE, TRUE, TRUE, ...)
>> to drop the first observation.
>> lm() and summary.lm() appear to treat these in the
>> same way, decrementing the number of degrees of
>> freedom for each dropped observation.  However,
>> predict.lm() does not treat them the same.  It
>> doesn't seem to diminish the df to account for the
>> 0-weighted observations.   E.g., the last printout
>> from the following script is as follows, where
>> predw is the prediction from the fit that used
>> 0-weights and preds is from using FALSE's in the
>> subset argument.  Is this difference proper?
> Nice catch.
> The issue is that the subset fit and the zero-weighted fit are not
> completely the same. Notice that the residuals vector has different
> length in the two analyses. With a simplified setup:
>> length(lm(y~1,weights=w)$residuals)
> [1] 10
>> length(lm(y~1,subset=-1)$residuals)
> [1] 9
>> w
>  [1] 0 1 1 1 1 1 1 1 1 1
> This in turn is what confuses predict.lm because it gets n and residual
> df from length(object$residuals). summary.lm() uses NROW(Qr$qr), and I
> suppose that predict.lm should follow suit.

...and then when I went to fix it, I found that the actual line in the
sources (stats/R/lm.R) reads

 27442     ripley     n <- length(object$residuals) # NROW(object$qr$qr)

so it's been like that since December 2003. I wonder if Brian remembers
what the point was? (27442 was the restructuring into the stats package,
so it might not actually be Brian's code).


Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

More information about the R-devel mailing list