[Rd] problem with zero-weighted observations in predict.lm?
Peter Dalgaard
pdalgd at gmail.com
Tue Jul 27 22:27:43 CEST 2010
William Dunlap wrote:
> In modelling functions some people like to use
> a weight of 0 to drop an observation instead of
> using a subset value of FALSE. E.g.,
> weights=c(0,1,1,...)
> instead of
> subset=c(FALSE, TRUE, TRUE, ...)
> to drop the first observation.
>
> lm() and summary.lm() appear to treat these in the
> same way, decrementing the number of degrees of
> freedom for each dropped observation. However,
> predict.lm() does not treat them the same. It
> doesn't seem to diminish the df to account for the
> 0-weighted observations. E.g., the last printout
> from the following script is as follows, where
> predw is the prediction from the fit that used
> 0-weights and preds is from using FALSE's in the
> subset argument. Is this difference proper?
Nice catch.
The issue is that the subset fit and the zero-weighted fit are not
completely the same. Notice that the residuals vector has different
length in the two analyses. With a simplified setup:
> length(lm(y~1,weights=w)$residuals)
[1] 10
> length(lm(y~1,subset=-1)$residuals)
[1] 9
> w
[1] 0 1 1 1 1 1 1 1 1 1
This in turn is what confuses predict.lm because it gets n and residual
df from length(object$residuals). summary.lm() uses NROW(Qr$qr), and I
suppose that predict.lm should follow suit.
>
> predw preds
> $fit $fit
> fit lwr upr fit lwr upr
> 1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724
> 2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341
>
> $se.fit $se.fit
> 1 2 1 2
> 0.06723657 0.09367810 0.1097969 0.1529757
>
> $df $df
> [1] 8 [1] 3
>
> $residual.scale $residual.scale
> [1] 0.1031362 [1] 0.1684207
>
> ### start of script ###
> data <- data.frame(x=1:10, y=log(1:10))
> fitw <- lm(data=data, y~x, weights=as.numeric(x<=5))
> fits <- lm(data=data, y~x, subset=x<=5)
> fitw$df.residual == 3 && fits$df.residual == 3 # TRUE
> identical(coef(fitw), coef(fits)) # TRUE
>
> sumw <- summary(fitw)
> sums <- summary(fits)
> identical(sumw$df, sums$df) # TRUE
>
> predw <- predict(fitw, interval="confidence",
> se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
> preds <- predict(fits, interval="confidence",
> se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
> all.equal(predw, preds) # lots of differences
>
> sideBySide <- function (a, b, argNames)
> {
> # print objects a and b side by side
> oldWidth <- options(width = getOption("width")/2 - 4)
> on.exit(options(oldWidth))
> if (missing(argNames)) {
> argNames <- c(deparse(substitute(a))[1],
> deparse(substitute(b))[1])
> }
> pa <- capture.output(print(a))
> pb <- capture.output(print(b))
> nlines <- max(length(pa), length(pb))
> length(pa) <- nlines
> length(pb) <- nlines
> pb[is.na(pb)] <- ""
> pa[is.na(pa)] <- ""
> retval <- cbind(pa, pb, deparse.level = 0)
> dimnames(retval) <- list(rep("",nrow(retval)), argNames)
> noquote(retval)
> }
>
> # lwr, upr, se.fit, df, residual.scale differ
> sideBySide(predw, preds)
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-devel
mailing list