[R] two-factor linear models with missing cells

Tue Aug 4 09:54:30 CEST 2009

Murray Jorgensen wrote:
> Hi Peter,
> 
> there is no problem if the missing cell is not in the first row or 
> column: the corresponding interaction parameter is omitted. In my case 
> the data in the (1,4) cell is missing. What results is clear to me now: 
> the (3,4) interaction parameter is dropped so that "(Intercept) + Biv" 
> now refers to the mean of the (3,4) cell rather than the that of the 
> (1,4) cell making the (3,4) cell a sort of 'honorary' member of the 
> first row. This could have been done to the (2,4) cell but I guess the 
> rule is to drop the cell with the highest sum of row and column number.

Ouch. I missed the point completely there...

The generic rule is that singularities (those not expected from the 
model formula) are detected by goin through the model matrix 
left-to-right. In this case, it happens when you get to the (3,4) 
interaction term because the sum of the (2,4) and (3,4) design columns 
is equal to the main effect of (*,4) except in the positions where you 
have no data. So the net effect is that that term is set to zero.

However, it's not true that "(Intercept) + Biv" refers to the (3,4) cell 
mean. You are missing the Athree term (3.755+3.330-1.635 = 5.450). The 
prediction for (1,4) is (3.755 + 0 - 1.635 = 2.120) is not equal to any 
cell mean. Rather, it comes about by completing the 2x2 table consisting 
of the 1st and 3rd row and the 1st and 4th column, assuming no interaction:

   1     4
1 3.755 ?
3 7.085 5.450

3.755 + (5.450 - 7.085) = 2.120

> 
> Murray Jorgensen
> 
> Peter Dalgaard wrote:
>> Murray Jorgensen wrote:
>>> I am wondering how to interpret the parameter estimates that lm()
>>> reports in this sort of situation:
>>>
>>> y = round(rnorm(n=24,mean=5,sd=2),2)
>>> A = gl(3,2,24,labels=c("one","two","three"))
>>> B = gl(4,6,24,labels=c("i","ii","iii","iv"))
>>> # Make both observations for A=1, B=4 missing
>>> y[19] = NA
>>> y[20] = NA
>>> data.frame(y,A,B)
>>> nonadd = lm(y ~ A * B)
>>>
>>>
>>>> summary(nonadd)
>>>
>>> Call:
>>> lm(formula = y ~ A * B)
>>>
>>> Residuals:
>>> Min 1Q Median 3Q Max
>>> -3.555e+00 -7.675e-01 -6.939e-17 7.675e-01 3.555e+00
>>>
>>> Coefficients: (1 not defined because of singularities)
>>> Estimate Std. Error t value Pr(>|t|)
>>> (Intercept) 3.755 1.667 2.252 0.0457 *
>>> Atwo 1.655 2.358 0.702 0.4974
>>> Athree 3.330 2.358 1.412 0.1856
>>> Bii 1.435 2.358 0.609 0.5552
>>> Biii 2.055 2.358 0.871 0.4021
>>> Biv -1.635 2.358 -0.693 0.5025
>>> Atwo:Bii -1.145 3.335 -0.343 0.7378
>>> Athree:Bii -4.535 3.335 -1.360 0.2011
>>> Atwo:Biii -3.230 3.335 -0.969 0.3536
>>> Athree:Biii -2.105 3.335 -0.631 0.5408
>>> Atwo:Biv 1.655 3.335 0.496 0.6295
>>> Athree:Biv NA NA NA NA
>>> ---
>>> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
>>>
>>> Residual standard error: 2.358 on 11 degrees of freedom
>>> (2 observations deleted due to missingness)
>>> Multiple R-squared: 0.2797, Adjusted R-squared: -0.3752
>>> F-statistic: 0.4271 on 10 and 11 DF, p-value: 0.9044
>>>
>>>> fitted(nonadd)
>>> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 21
>>> 3.755 3.755 5.410 5.410 7.085 7.085 5.190 5.190 5.700 5.700 3.985 3.985
>>> 5.810 5.810 4.235 4.235 7.035 7.035 5.430
>>> 22 23 24
>>> 5.430 5.450 5.450
>>>> t(model.matrix(nonadd)%*%coef(nonadd))
>>> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 21 22 23 24
>>> [1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>>>
>>> I guess that the parameter estimates reported are linear combinations of
>>> the cell means, but which linear combinations and how does lm() decide
>>> what parameters to report?
>>>
>>> Cheers, Murray
>>>
>>
>> What's the problem? The parameters are defined as usual for the 
>> two-way layout:
>>
>> The intercept is the fitted value in the top left corner
>> The A coefficients are the fitted values in the first column minus the 
>> intercept.
>> The B coefficients vice versa.
>> The interaction coefficients are the fitted values minus the sum of 
>> the the intercept and the corresponding A and B coefficients.
>>
>> One interaction coefficient is set missing because you have no data, 
>> but except for that, the fitted values equal the cell means.
>>
> 

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907