[R] lm: mark sample used in estimation

(Ted Harding) ted.harding at wlandres.net
Mon Jul 11 22:54:11 CEST 2011


On 11-Jul-11 07:55:44, Anirban Mukherjee wrote:
> Hi all,
> 
> I wanted to mark the estimation sample: mark what rows (observations)
> are deleted by lm due to missingness. For eg, from the original
> example in help, I have changed one of the values in trt to be NA
> (missing).
> 
># code below
># ----
># original example
>> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
> 
># change 18th observation of trt
>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69)
>> group <- gl(2,10,20, labels=c("Ctl","Trt"))
>> weight <- c(ctl, trt)
>> lm.D9 <- lm(weight ~ group)
>> summary(lm.D9)
> 
> Call:
> lm(formula = weight ~ group)
> 
> Residuals:
> ____ Min______ 1Q__ Median______ 3Q_____ Max
> -1.04556 -0.48378_ 0.05444_ 0.23622_ 1.39444
> 
> Coefficients:
> ___________ Estimate Std. Error t value Pr(>|t|)
> (Intercept)__ 5.0320____ 0.2258_ 22.281 5.09e-14 ***
> groupTrt____ -0.3964____ 0.3281_ -1.208___ 0.244
> ---
> Signif. codes:_ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Residual standard error: 0.7142 on 17 degrees of freedom
> _ (1 observation deleted due to missingness)
> Multiple R-squared: 0.07907,___ Adjusted R-squared: 0.0249
> F-statistic:_ 1.46 on 1 and 17 DF,_ p-value: 0.2435
> 
># ------
># end snippet
> 
> I want to generate an indicator variable to mark the observations used
> in estimation: 1 for a row not deleted, 0 for a row deleted. In this
> case I want an indicator variable that has seventeen 1s, one 0, and
> then 2 1s. I know I can do ind = !is.na(group) in the above example.
> But I am ideally looking for a way that allows one to use any formula
> in lm, and still be able to mark the estimation sample.
> Function/option I am missing? The best I could come up with:
> 
>> lm.D9 <- lm(weight ~ group, model=TRUE)
>> ind <- as.numeric(row.names(lm.D9$model))
>> esamp <- rep(0,length(group)) #substitute nrow(data.frame used in
>> estimation) for length(group)
>> esamp[ind] <- 1
>> esamp
>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
> 
> Is this "safe" (recommended?)?
> 
> Appreciate any help.
> 
> Best, A

Separately from Peter Dalgaard's response, you raise a generic
quedtion about how to find out which observations have been
used in an LM fit when some cases may have been omitted, e.g.
because of missing values (NA).

Take the following as an example:

  X   <- (1:10)
  Y   <- X + rnorm(10)
  LM  <- lm(Y ~ X)

  X1  <- X
  X1[c(4,8)] <- NA ## so cases 4 & 8 will be omitted
  LM1 <- lm(Y ~ X1)

  row.names(LM$model)
  # [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
  row.names(LM1$model)
  # [1] "1"  "2"  "3"  "5"  "6"  "7"  "9"  "10"

  which( (row.names(LM$model) %in% row.names(LM1$model)) )
  # [1]  1  2  3  5  6  7  9 10
  ### These are the indices of the cases which were kept

  which(!(row.names(LM$model) %in% row.names(LM1$model)) )
  # [1] 4 8
  ### These are this indices of the cases which were omitted

You could also use 'names(LM$res)' and 'names(LM1$res)'
instead of 'row.names(LM$model' and 'row.names(LM$model)'
in the above.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 11-Jul-11                                       Time: 21:54:05
------------------------------ XFMail ------------------------------



More information about the R-help mailing list