[R] lm: mark sample used in estimation

Anirban Mukherjee anirban.mukherjee at gmail.com
Mon Jul 11 09:55:44 CEST 2011


Hi all,

I wanted to mark the estimation sample: mark what rows (observations)
are deleted by lm due to missingness. For eg, from the original
example in help, I have changed one of the values in trt to be NA
(missing).

# code below
# ----
# original example
> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)

# change 18th observation of trt
> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69)
> group <- gl(2,10,20, labels=c("Ctl","Trt"))
> weight <- c(ctl, trt)
> lm.D9 <- lm(weight ~ group)
> summary(lm.D9)

Call:
lm(formula = weight ~ group)

Residuals:
     Min       1Q   Median       3Q      Max
-1.04556 -0.48378  0.05444  0.23622  1.39444

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.2258  22.281 5.09e-14 ***
groupTrt     -0.3964     0.3281  -1.208    0.244
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7142 on 17 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared: 0.07907,    Adjusted R-squared: 0.0249
F-statistic:  1.46 on 1 and 17 DF,  p-value: 0.2435

# ------
# end snippet

I want to generate an indicator variable to mark the observations used
in estimation: 1 for a row not deleted, 0 for a row deleted. In this
case I want an indicator variable that has seventeen 1s, one 0, and
then 2 1s. I know I can do ind = !is.na(group) in the above example.
But I am ideally looking for a way that allows one to use any formula
in lm, and still be able to mark the estimation sample.
Function/option I am missing? The best I could come up with:

> lm.D9 <- lm(weight ~ group, model=TRUE)
> ind <- as.numeric(row.names(lm.D9$model))
> esamp <- rep(0,length(group)) #substitute nrow(data.frame used in estimation) for length(group)
> esamp[ind] <- 1
> esamp
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1

Is this "safe" (recommended?)?

Appreciate any help.

Best, A



More information about the R-help mailing list