[Rd] problem with zero-weighted observations in predict.lm?

William Dunlap wdunlap at tibco.com
Tue Jul 27 20:48:56 CEST 2010


In modelling functions some people like to use
a weight of 0 to drop an observation instead of
using a subset value of FALSE.  E.g.,
  weights=c(0,1,1,...)
instead of
  subset=c(FALSE, TRUE, TRUE, ...)
to drop the first observation.

lm() and summary.lm() appear to treat these in the
same way, decrementing the number of degrees of
freedom for each dropped observation.  However,
predict.lm() does not treat them the same.  It
doesn't seem to diminish the df to account for the
0-weighted observations.   E.g., the last printout
from the following script is as follows, where
predw is the prediction from the fit that used
0-weights and preds is from using FALSE's in the
subset argument.  Is this difference proper?

 predw                        preds                       
 $fit                         $fit                        
        fit      lwr      upr        fit      lwr      upr
 1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724
 2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341
                                                          
 $se.fit                      $se.fit                     
          1          2                1         2         
 0.06723657 0.09367810        0.1097969 0.1529757         
                                                          
 $df                          $df                         
 [1] 8                        [1] 3                       
                                                          
 $residual.scale              $residual.scale             
 [1] 0.1031362                [1] 0.1684207                

### start of script ###
data <- data.frame(x=1:10, y=log(1:10))
fitw <- lm(data=data, y~x, weights=as.numeric(x<=5))
fits <- lm(data=data, y~x, subset=x<=5)
fitw$df.residual  == 3 && fits$df.residual == 3 # TRUE
identical(coef(fitw), coef(fits)) # TRUE

sumw <- summary(fitw)
sums <- summary(fits)
identical(sumw$df, sums$df) # TRUE

predw <- predict(fitw, interval="confidence",
         se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
preds <- predict(fits, interval="confidence",
         se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
all.equal(predw, preds) # lots of differences

sideBySide <- function (a, b, argNames)
{
    # print objects a and b side by side
    oldWidth <- options(width = getOption("width")/2 - 4)
    on.exit(options(oldWidth))
    if (missing(argNames)) {
        argNames <- c(deparse(substitute(a))[1],
                      deparse(substitute(b))[1])
    }
    pa <- capture.output(print(a))
    pb <- capture.output(print(b))
    nlines <- max(length(pa), length(pb))
    length(pa) <- nlines
    length(pb) <- nlines
    pb[is.na(pb)] <- ""
    pa[is.na(pa)] <- ""
    retval <- cbind(pa, pb, deparse.level = 0)
    dimnames(retval) <- list(rep("",nrow(retval)), argNames)
    noquote(retval)
}

# lwr, upr, se.fit, df, residual.scale differ
sideBySide(predw, preds)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 



More information about the R-devel mailing list