[Rd] problem with zero-weighted observations in predict.lm?

Tue Jul 27 22:27:43 CEST 2010

William Dunlap wrote:
> In modelling functions some people like to use
> a weight of 0 to drop an observation instead of
> using a subset value of FALSE.  E.g.,
>   weights=c(0,1,1,...)
> instead of
>   subset=c(FALSE, TRUE, TRUE, ...)
> to drop the first observation.
> 
> lm() and summary.lm() appear to treat these in the
> same way, decrementing the number of degrees of
> freedom for each dropped observation.  However,
> predict.lm() does not treat them the same.  It
> doesn't seem to diminish the df to account for the
> 0-weighted observations.   E.g., the last printout
> from the following script is as follows, where
> predw is the prediction from the fit that used
> 0-weights and preds is from using FALSE's in the
> subset argument.  Is this difference proper?

Nice catch.

The issue is that the subset fit and the zero-weighted fit are not
completely the same. Notice that the residuals vector has different
length in the two analyses. With a simplified setup:

> length(lm(y~1,weights=w)$residuals)
[1] 10
> length(lm(y~1,subset=-1)$residuals)
[1] 9
> w
 [1] 0 1 1 1 1 1 1 1 1 1

This in turn is what confuses predict.lm because it gets n and residual
df from length(object$residuals). summary.lm() uses NROW(Qr$qr), and I
suppose that predict.lm should follow suit.

> 
>  predw                        preds                       
>  $fit                         $fit                        
>         fit      lwr      upr        fit      lwr      upr
>  1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724
>  2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341
>                                                           
>  $se.fit                      $se.fit                     
>           1          2                1         2         
>  0.06723657 0.09367810        0.1097969 0.1529757         
>                                                           
>  $df                          $df                         
>  [1] 8                        [1] 3                       
>                                                           
>  $residual.scale              $residual.scale             
>  [1] 0.1031362                [1] 0.1684207                
> 
> ### start of script ###
> data <- data.frame(x=1:10, y=log(1:10))
> fitw <- lm(data=data, y~x, weights=as.numeric(x<=5))
> fits <- lm(data=data, y~x, subset=x<=5)
> fitw$df.residual  == 3 && fits$df.residual == 3 # TRUE
> identical(coef(fitw), coef(fits)) # TRUE
> 
> sumw <- summary(fitw)
> sums <- summary(fits)
> identical(sumw$df, sums$df) # TRUE
> 
> predw <- predict(fitw, interval="confidence",
>          se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
> preds <- predict(fits, interval="confidence",
>          se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
> all.equal(predw, preds) # lots of differences
> 
> sideBySide <- function (a, b, argNames)
> {
>     # print objects a and b side by side
>     oldWidth <- options(width = getOption("width")/2 - 4)
>     on.exit(options(oldWidth))
>     if (missing(argNames)) {
>         argNames <- c(deparse(substitute(a))[1],
>                       deparse(substitute(b))[1])
>     }
>     pa <- capture.output(print(a))
>     pb <- capture.output(print(b))
>     nlines <- max(length(pa), length(pb))
>     length(pa) <- nlines
>     length(pb) <- nlines
>     pb[is.na(pb)] <- ""
>     pa[is.na(pa)] <- ""
>     retval <- cbind(pa, pb, deparse.level = 0)
>     dimnames(retval) <- list(rep("",nrow(retval)), argNames)
>     noquote(retval)
> }
> 
> # lwr, upr, se.fit, df, residual.scale differ
> sideBySide(predw, preds)
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com 
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com