[R] lm: mark sample used in estimation
Anirban Mukherjee
anirban.mukherjee at gmail.com
Tue Jul 12 17:06:21 CEST 2011
Thanks Peter, Ted!
Best, Anirban
On Tue, Jul 12, 2011 at 4:54 AM, Ted Harding <ted.harding at wlandres.net> wrote:
> On 11-Jul-11 07:55:44, Anirban Mukherjee wrote:
>> Hi all,
>>
>> I wanted to mark the estimation sample: mark what rows (observations)
>> are deleted by lm due to missingness. For eg, from the original
>> example in help, I have changed one of the values in trt to be NA
>> (missing).
>>
>># code below
>># ----
>># original example
>>> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
>>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
>>
>># change 18th observation of trt
>>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69)
>>> group <- gl(2,10,20, labels=c("Ctl","Trt"))
>>> weight <- c(ctl, trt)
>>> lm.D9 <- lm(weight ~ group)
>>> summary(lm.D9)
>>
>> Call:
>> lm(formula = weight ~ group)
>>
>> Residuals:
>> ____ Min______ 1Q__ Median______ 3Q_____ Max
>> -1.04556 -0.48378_ 0.05444_ 0.23622_ 1.39444
>>
>> Coefficients:
>> ___________ Estimate Std. Error t value Pr(>|t|)
>> (Intercept)__ 5.0320____ 0.2258_ 22.281 5.09e-14 ***
>> groupTrt____ -0.3964____ 0.3281_ -1.208___ 0.244
>> ---
>> Signif. codes:_ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 0.7142 on 17 degrees of freedom
>> _ (1 observation deleted due to missingness)
>> Multiple R-squared: 0.07907,___ Adjusted R-squared: 0.0249
>> F-statistic:_ 1.46 on 1 and 17 DF,_ p-value: 0.2435
>>
>># ------
>># end snippet
>>
>> I want to generate an indicator variable to mark the observations used
>> in estimation: 1 for a row not deleted, 0 for a row deleted. In this
>> case I want an indicator variable that has seventeen 1s, one 0, and
>> then 2 1s. I know I can do ind = !is.na(group) in the above example.
>> But I am ideally looking for a way that allows one to use any formula
>> in lm, and still be able to mark the estimation sample.
>> Function/option I am missing? The best I could come up with:
>>
>>> lm.D9 <- lm(weight ~ group, model=TRUE)
>>> ind <- as.numeric(row.names(lm.D9$model))
>>> esamp <- rep(0,length(group)) #substitute nrow(data.frame used in
>>> estimation) for length(group)
>>> esamp[ind] <- 1
>>> esamp
>> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
>>
>> Is this "safe" (recommended?)?
>>
>> Appreciate any help.
>>
>> Best, A
>
> Separately from Peter Dalgaard's response, you raise a generic
> quedtion about how to find out which observations have been
> used in an LM fit when some cases may have been omitted, e.g.
> because of missing values (NA).
>
> Take the following as an example:
>
> X <- (1:10)
> Y <- X + rnorm(10)
> LM <- lm(Y ~ X)
>
> X1 <- X
> X1[c(4,8)] <- NA ## so cases 4 & 8 will be omitted
> LM1 <- lm(Y ~ X1)
>
> row.names(LM$model)
> # [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
> row.names(LM1$model)
> # [1] "1" "2" "3" "5" "6" "7" "9" "10"
>
> which( (row.names(LM$model) %in% row.names(LM1$model)) )
> # [1] 1 2 3 5 6 7 9 10
> ### These are the indices of the cases which were kept
>
> which(!(row.names(LM$model) %in% row.names(LM1$model)) )
> # [1] 4 8
> ### These are this indices of the cases which were omitted
>
> You could also use 'names(LM$res)' and 'names(LM1$res)'
> instead of 'row.names(LM$model' and 'row.names(LM$model)'
> in the above.
>
> Hoping this helps,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <ted.harding at wlandres.net>
> Fax-to-email: +44 (0)870 094 0861
> Date: 11-Jul-11 Time: 21:54:05
> ------------------------------ XFMail ------------------------------
>
More information about the R-help
mailing list