[R] missing level of a nested factor results in an NA in lm output

Mon Sep 21 20:19:35 CEST 2009

Dear All,

I have posted to this list before regarding the same issue so I  
apologize for the multiple e-mails.  I am still struggling with this  
issue so I thought I'd give it another try.  This time I have included  
reproducible code and a subset of the data I am analyzing.

I am running an ANOVA with three factors:  GROUP (5 levels), FEATURE  
(2 levels), and PATIENT (2 levels), where PATIENT is nested within  
GROUP.  I am interested in estimating various linear functions of the  
model coefficients (which I sometimes refer to as 'contrasts' below).   
An example of the data can be set up using the following code:

example <- data.frame(
		ABUNDANCE = rnorm(30, 12),
		FEATURE = factor(rep(c(3218, 4227, 6374), 10)),
		GROUP = factor(rep(c(0, 1, 2, 3, 4), 6)),
		PATIENT = factor(rep(c(1, 2), 15))
)

I am using the lm function to run the model as shown.

fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/PATIENT,  
example)
summary(fit)

The output of this code is below.

 > fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ 
PATIENT, example)
 > summary(fit)

Call:
lm(formula = ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ 
PATIENT,
    data = example)

Residuals:
       Min         1Q     Median         3Q        Max
-5.510e-01 -1.961e-01  3.469e-17  1.961e-01  5.510e-01

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)         12.6487     0.4037  31.331 2.58e-11 ***
FEATURE4227         -0.3271     0.4944  -0.661  0.52325
FEATURE6374         -1.0743     0.4944  -2.173  0.05492 .
GROUP1              -1.2641     0.5709  -2.214  0.05120 .
GROUP2              -0.2359     0.5709  -0.413  0.68825
GROUP3              -1.3081     0.5709  -2.291  0.04492 *
GROUP4              -0.5867     0.5709  -1.028  0.32836
FEATURE4227:GROUP1   1.5651     0.6993   2.238  0.04915 *
FEATURE6374:GROUP1   1.1435     0.6993   1.635  0.13302
FEATURE4227:GROUP2   0.4372     0.6993   0.625  0.54577
FEATURE6374:GROUP2   0.6728     0.6993   0.962  0.35867
FEATURE4227:GROUP3   1.0135     0.6993   1.449  0.17786
FEATURE6374:GROUP3   2.2665     0.6993   3.241  0.00885 **
FEATURE4227:GROUP4   1.3278     0.6993   1.899  0.08679 .
FEATURE6374:GROUP4   0.5610     0.6993   0.802  0.44103
GROUP0:PATIENT2     -0.5569     0.4037  -1.379  0.19785
GROUP1:PATIENT2     -0.1104     0.4037  -0.273  0.79014
GROUP2:PATIENT2     -0.9702     0.4037  -2.403  0.03712 *
GROUP3:PATIENT2     -0.1400     0.4037  -0.347  0.73586
GROUP4:PATIENT2     -0.5947     0.4037  -1.473  0.17147
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4944 on 10 degrees of freedom
Multiple R-squared: 0.8004,     Adjusted R-squared: 0.4211
F-statistic:  2.11 on 19 and 10 DF,  p-value: 0.1133

I then use the estimable function to estimate a linear combination of  
the parameter estimates.

myEstimate <- cbind(
			'(Intercept)' = 1,
			'GROUP1' = 1,
			'FEATURE4227:GROUP1' = 0.5,
			'FEATURE6374:GROUP1' = 0.5,
			'GROUP0:PATIENT2' = 1
)
rownames(myEstimate) <- "test"

 > estimable(fit, myEstimate)
     Estimate Std. Error  t value DF     Pr(>|t|)
test 12.18198  0.6694812 18.19615 10 5.395944e-09

I am able to get the t-statistic and associated p-value for the  
contrast as desired.  However, I sometimes have a case where there is  
a missing patient within one of the groups.  To give an example of  
this situation, I remove patient 2 from group 4 and perform the same  
analysis.

example2 <- example[!(example$GROUP == 4 & example$PATIENT == 2),]

 > fit <- lm(ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ 
PATIENT, example2)
 > summary(fit)

Call:
lm(formula = ABUNDANCE ~ FEATURE + GROUP + FEATURE:GROUP + GROUP/ 
PATIENT,
    data = example2)

Residuals:
       Min         1Q     Median         3Q        Max
-5.510e-01 -2.084e-01  6.099e-20  2.084e-01  5.510e-01

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)         12.6487     0.4442  28.475  2.5e-09 ***
FEATURE4227         -0.3271     0.5440  -0.601   0.5644
FEATURE6374         -1.0743     0.5440  -1.975   0.0837 .
GROUP1              -1.2641     0.6282  -2.012   0.0790 .
GROUP2              -0.2359     0.6282  -0.375   0.7171
GROUP3              -1.3081     0.6282  -2.082   0.0709 .
GROUP4              -0.4570     0.7023  -0.651   0.5335
FEATURE4227:GROUP1   1.5651     0.7694   2.034   0.0764 .
FEATURE6374:GROUP1   1.1435     0.7694   1.486   0.1755
FEATURE4227:GROUP2   0.4372     0.7694   0.568   0.5854
FEATURE6374:GROUP2   0.6728     0.7694   0.874   0.4074
FEATURE4227:GROUP3   1.0135     0.7694   1.317   0.2242
FEATURE6374:GROUP3   2.2665     0.7694   2.946   0.0185 *
FEATURE4227:GROUP4   1.2146     0.9423   1.289   0.2334
FEATURE6374:GROUP4   0.2850     0.9423   0.302   0.7700
GROUP0:PATIENT2     -0.5569     0.4442  -1.254   0.2454
GROUP1:PATIENT2     -0.1104     0.4442  -0.248   0.8101
GROUP2:PATIENT2     -0.9702     0.4442  -2.184   0.0605 .
GROUP3:PATIENT2     -0.1400     0.4442  -0.315   0.7606
GROUP4:PATIENT2          NA         NA      NA       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.544 on 8 degrees of freedom
Multiple R-squared: 0.7852,     Adjusted R-squared: 0.3019
F-statistic: 1.625 on 18 and 8 DF,  p-value: 0.2462

As you can see there are now NAs in the row corresponding to the  
removed patient.  I still want to estimate my contrast so I run the  
same code as before, but I receive the following error messages:

myEstimate <- cbind(
			'(Intercept)' = 1,
			'GROUP1' = 1,
			'FEATURE4227:GROUP1' = 0.5,
			'FEATURE6374:GROUP1' = 0.5,
			'GROUP0:PATIENT2' = 1
)
rownames(myEstimate) <- "test"

 > estimable(fit, myEstimate)
Error in estimable.default(fit, myEstimate) :
Dimension of structure(c(1, 0, 0, 1, 0, 0, 0, 0.5, 0.5, 0, 0, 0, 0, 0,  
0, : 1x20, not compatible with no of parameters in fit: 19
Dimension of 1, 0, 0, 0, 0), .Dim = c(1L, 20L), .Dimnames =  
list("test", c("(Intercept)", : 1x20, not compatible with no of  
parameters in fit: 19
Dimension of "FEATURE4227", "FEATURE6374", "GROUP1", "GROUP2",  
"GROUP3", "GROUP4", : 1x20, not compatible with no of parameters in  
fit: 19
Dimension of "FEATURE4227:GROUP1", "FEATURE6374:GROUP1",  
"FEATURE4227:GROUP2", : 1x20, not compatible with no of parameters in  
fit: 19
Dimension of "FEATURE6374:GROUP2", "FEATURE4227:GROUP3",  
"FEATURE6374:GROUP3", : 1x20, not compatible with no of parameters in  
fit: 19
Dimension of "FEATURE4227:GROUP4", "FEATURE6374:GROUP4",  
"GROUP0:PATIENT2", : 1x20, not compatible with no of parameters in  
fit: 19
Dimension of "GROUP1:PATIENT2", "GROUP2:PATIENT2", "GROUP3:PATIENT2",  
"GROUP4:PATIENT2": 1x20, not compatible with no of parameters in fit: 19
Dimension of ))): 1x2

Apparently the NA in the parameter estimate table is preventing the  
estimation of the contrast.  I have tried multiple alternatives in the  
coding of the estimable statement, but all give me errors similar to  
the one above.

I think the problem is that R expects to see each patient within each  
group and when it doesn't see patient 2 in group 4, it yields an NA  
parameter estimate for this combination.  That's fine - the results of  
the other parameter estimates are still correct.   However it's  
preventing me from estimating a linear function of the model  
coefficients, and I know that there must be a way to do this even in  
the presence of the missing combination.

Has anyone run into this problem before?

Tim

--
Timothy Clough
Ph.D. Student
Department of Statistics
Purdue University
765-496-7263