[R] lm: mark sample used in estimation
(Ted Harding)
ted.harding at wlandres.net
Mon Jul 11 22:54:11 CEST 2011
On 11-Jul-11 07:55:44, Anirban Mukherjee wrote:
> Hi all,
>
> I wanted to mark the estimation sample: mark what rows (observations)
> are deleted by lm due to missingness. For eg, from the original
> example in help, I have changed one of the values in trt to be NA
> (missing).
>
># code below
># ----
># original example
>> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
>
># change 18th observation of trt
>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69)
>> group <- gl(2,10,20, labels=c("Ctl","Trt"))
>> weight <- c(ctl, trt)
>> lm.D9 <- lm(weight ~ group)
>> summary(lm.D9)
>
> Call:
> lm(formula = weight ~ group)
>
> Residuals:
> ____ Min______ 1Q__ Median______ 3Q_____ Max
> -1.04556 -0.48378_ 0.05444_ 0.23622_ 1.39444
>
> Coefficients:
> ___________ Estimate Std. Error t value Pr(>|t|)
> (Intercept)__ 5.0320____ 0.2258_ 22.281 5.09e-14 ***
> groupTrt____ -0.3964____ 0.3281_ -1.208___ 0.244
> ---
> Signif. codes:_ 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
> Residual standard error: 0.7142 on 17 degrees of freedom
> _ (1 observation deleted due to missingness)
> Multiple R-squared: 0.07907,___ Adjusted R-squared: 0.0249
> F-statistic:_ 1.46 on 1 and 17 DF,_ p-value: 0.2435
>
># ------
># end snippet
>
> I want to generate an indicator variable to mark the observations used
> in estimation: 1 for a row not deleted, 0 for a row deleted. In this
> case I want an indicator variable that has seventeen 1s, one 0, and
> then 2 1s. I know I can do ind = !is.na(group) in the above example.
> But I am ideally looking for a way that allows one to use any formula
> in lm, and still be able to mark the estimation sample.
> Function/option I am missing? The best I could come up with:
>
>> lm.D9 <- lm(weight ~ group, model=TRUE)
>> ind <- as.numeric(row.names(lm.D9$model))
>> esamp <- rep(0,length(group)) #substitute nrow(data.frame used in
>> estimation) for length(group)
>> esamp[ind] <- 1
>> esamp
> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
>
> Is this "safe" (recommended?)?
>
> Appreciate any help.
>
> Best, A
Separately from Peter Dalgaard's response, you raise a generic
quedtion about how to find out which observations have been
used in an LM fit when some cases may have been omitted, e.g.
because of missing values (NA).
Take the following as an example:
X <- (1:10)
Y <- X + rnorm(10)
LM <- lm(Y ~ X)
X1 <- X
X1[c(4,8)] <- NA ## so cases 4 & 8 will be omitted
LM1 <- lm(Y ~ X1)
row.names(LM$model)
# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
row.names(LM1$model)
# [1] "1" "2" "3" "5" "6" "7" "9" "10"
which( (row.names(LM$model) %in% row.names(LM1$model)) )
# [1] 1 2 3 5 6 7 9 10
### These are the indices of the cases which were kept
which(!(row.names(LM$model) %in% row.names(LM1$model)) )
# [1] 4 8
### These are this indices of the cases which were omitted
You could also use 'names(LM$res)' and 'names(LM1$res)'
instead of 'row.names(LM$model' and 'row.names(LM$model)'
in the above.
Hoping this helps,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 11-Jul-11 Time: 21:54:05
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list