[R] lm: mark sample used in estimation

Anirban Mukherjee anirban.mukherjee at gmail.com
Tue Jul 12 17:06:21 CEST 2011


Thanks Peter, Ted!

Best, Anirban

On Tue, Jul 12, 2011 at 4:54 AM, Ted Harding <ted.harding at wlandres.net> wrote:
> On 11-Jul-11 07:55:44, Anirban Mukherjee wrote:
>> Hi all,
>>
>> I wanted to mark the estimation sample: mark what rows (observations)
>> are deleted by lm due to missingness. For eg, from the original
>> example in help, I have changed one of the values in trt to be NA
>> (missing).
>>
>># code below
>># ----
>># original example
>>> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
>>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
>>
>># change 18th observation of trt
>>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69)
>>> group <- gl(2,10,20, labels=c("Ctl","Trt"))
>>> weight <- c(ctl, trt)
>>> lm.D9 <- lm(weight ~ group)
>>> summary(lm.D9)
>>
>> Call:
>> lm(formula = weight ~ group)
>>
>> Residuals:
>> ____ Min______ 1Q__ Median______ 3Q_____ Max
>> -1.04556 -0.48378_ 0.05444_ 0.23622_ 1.39444
>>
>> Coefficients:
>> ___________ Estimate Std. Error t value Pr(>|t|)
>> (Intercept)__ 5.0320____ 0.2258_ 22.281 5.09e-14 ***
>> groupTrt____ -0.3964____ 0.3281_ -1.208___ 0.244
>> ---
>> Signif. codes:_ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 0.7142 on 17 degrees of freedom
>> _ (1 observation deleted due to missingness)
>> Multiple R-squared: 0.07907,___ Adjusted R-squared: 0.0249
>> F-statistic:_ 1.46 on 1 and 17 DF,_ p-value: 0.2435
>>
>># ------
>># end snippet
>>
>> I want to generate an indicator variable to mark the observations used
>> in estimation: 1 for a row not deleted, 0 for a row deleted. In this
>> case I want an indicator variable that has seventeen 1s, one 0, and
>> then 2 1s. I know I can do ind = !is.na(group) in the above example.
>> But I am ideally looking for a way that allows one to use any formula
>> in lm, and still be able to mark the estimation sample.
>> Function/option I am missing? The best I could come up with:
>>
>>> lm.D9 <- lm(weight ~ group, model=TRUE)
>>> ind <- as.numeric(row.names(lm.D9$model))
>>> esamp <- rep(0,length(group)) #substitute nrow(data.frame used in
>>> estimation) for length(group)
>>> esamp[ind] <- 1
>>> esamp
>>  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
>>
>> Is this "safe" (recommended?)?
>>
>> Appreciate any help.
>>
>> Best, A
>
> Separately from Peter Dalgaard's response, you raise a generic
> quedtion about how to find out which observations have been
> used in an LM fit when some cases may have been omitted, e.g.
> because of missing values (NA).
>
> Take the following as an example:
>
>  X   <- (1:10)
>  Y   <- X + rnorm(10)
>  LM  <- lm(Y ~ X)
>
>  X1  <- X
>  X1[c(4,8)] <- NA ## so cases 4 & 8 will be omitted
>  LM1 <- lm(Y ~ X1)
>
>  row.names(LM$model)
>  # [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
>  row.names(LM1$model)
>  # [1] "1"  "2"  "3"  "5"  "6"  "7"  "9"  "10"
>
>  which( (row.names(LM$model) %in% row.names(LM1$model)) )
>  # [1]  1  2  3  5  6  7  9 10
>  ### These are the indices of the cases which were kept
>
>  which(!(row.names(LM$model) %in% row.names(LM1$model)) )
>  # [1] 4 8
>  ### These are this indices of the cases which were omitted
>
> You could also use 'names(LM$res)' and 'names(LM1$res)'
> instead of 'row.names(LM$model' and 'row.names(LM$model)'
> in the above.
>
> Hoping this helps,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <ted.harding at wlandres.net>
> Fax-to-email: +44 (0)870 094 0861
> Date: 11-Jul-11                                       Time: 21:54:05
> ------------------------------ XFMail ------------------------------
>



More information about the R-help mailing list