[R] About the interaction A:B
Jeff Laake
Jeff.Laake at noaa.gov
Fri Mar 5 19:32:32 CET 2010
You are correct that you need to use ~-1+A:B. I use that all the time
and just spaced it out when I was writing
the response. Using ~A:B will produce one too many columns.
Didn't follow your other question. You can always look at the result of
model.matrix to see if it is correct or dig into the
internal code. Personally, I've never found it to have a problem.
--jeff
On 3/5/2010 10:16 AM, blue sky wrote:
> On Fri, Mar 5, 2010 at 11:41 AM, Jeff Laake<Jeff.Laake at noaa.gov> wrote:
>
>> On 3/5/2010 9:19 AM, Frank E Harrell Jr wrote:
>>
>>> You neglected to state your name and affiliation, and your question
>>> demonstrates an allergy to R documentation.
>>>
>>> Frank
>>>
>> I agree with Frank but will try to answer some of your questions as I
>> understand it.
>>
>> First, model.matrix uses the options$contrasts or what is set specifically
>> as the contrast for a factor using contrasts(). The default is
>> treatment contrasts and for a factor that means the first level of a factor
>> variable is the intercept and the remainder are "treatment" effects
>> which are added to intercept. If you specify Y~A+B+A:B, you are specifying
>> the model with main effects (A, B) and interactions (A:B).
>>
> So my interpretation of how R does internally when dealing with
> interaction terms A:B is to look if each individual term (each sub
> interactions in highway interactions like, A:B, A:C, etc., in A:B:C:D)
> appears in the formula or not, and deciding how to construct the
> columns of the model matrix according to A:B, right?
>
> This seem quite complicated when a formula is arbitrarily complex, let
> along different type of data (namely ordered, not ordered, numerical
> numbers). The piece of information that I feel that is still missing
> to me is the detailed procedure that R generate model.matrix for
> arbitrary complex formulas and how it is proved the way that R does is
> always correct for these complex formulas.
>
>
>> If A has m levels and B has n levels then there will be an intercept (1),
>> m-1 for A, n-1 for B and (m-1)*(n-1) for the interaction.
>> If you specify the model as Y~A:B it will specify the model fully with
>> interactions which will have m*n separate parameters and none will be
>> NA as long as you have data in each of the m*n cells. It makes absolutely
>> no sense to me to have an intercept and then all but one of the
>> remaining interactions included.
>>
> I think that 'Y ~ A:B' indeed has the intercept term unless '-1' is
> included. You may need to adjust your interpretation of A:B and A:B -1
> in these cases.
>
>> dim(my_subset(model.matrix(Y ~ A:B - 1,fr)))
>>
> [1] 12 12
>
>> dim(my_subset(model.matrix(Y ~ A:B,fr)))
>>
> [1] 12 13
>
>
>
>> Note that you'll get quite different results if A and/or B are not factor
>> variables.
>>
>> --jeff
>>
>>> blue sky wrote:
>>>
>>>> The following is the code for the model.matrix. But it still doesn't
>>>> answer why A:B is interpreted differently in Y~A+B+A:B and Y~A:B. By
>>>> 'why', I mean how R internally does it and what is the rational behind
>>>> the way of doing it?
>>>>
>>>> And it didn't answer why in the model.matrix of Y~A, there are a-1
>>>> terms from A plus the intercept, but in the model.matrix of Y~A:B,
>>>> there are a*b terms (rather than a*b-1 terms) plus the intercept.
>>>> Since the one coefficient of the lm of Y~A:B is going to be NA, why
>>>> bother to include the corresponding term in the model matrix?
>>>>
>>>> ############code below
>>>>
>>>> set.seed(0)
>>>>
>>>> a=3
>>>> b=4
>>>>
>>>> AB_effect=data.frame(
>>>> name=paste(
>>>> unlist(
>>>> do.call(
>>>> rbind
>>>> , rep(list(paste('A', letters[1:a],sep='')), b)
>>>> )
>>>> )
>>>> , unlist(
>>>> do.call(
>>>> cbind
>>>> , rep(list(paste('B', letters[1:b],sep='')), a)
>>>> )
>>>> )
>>>> , sep=':'
>>>> )
>>>> , value=rnorm(a*b)
>>>> , stringsAsFactors=F
>>>> )
>>>>
>>>> max_n=10
>>>> n=sample.int(max_n, a*b, replace=T)
>>>>
>>>> AB=mapply(function(name, n){rep(name,n)}, AB_effect$name, n)
>>>>
>>>> Y=AB_effect$value[match(unlist(AB), AB_effect$name)]
>>>>
>>>> Y=Y+a*b*rnorm(length(Y))
>>>>
>>>> sub_fr=as.data.frame(do.call(rbind, strsplit(unlist(AB), ':')))
>>>> rownames(sub_fr)=NULL
>>>> colnames(sub_fr)=c('A', 'B')
>>>>
>>>> fr=data.frame(Y=Y,sub_fr)
>>>>
>>>> my_subset=function(amm) {
>>>> coding=apply(
>>>> amm
>>>> ,1
>>>> , function(x) {
>>>> paste(x, collapse='')
>>>> }
>>>> )
>>>> amm[match(unique(coding), coding),]
>>>> }
>>>>
>>>> my_subset(model.matrix(Y ~ A*B,fr))
>>>> my_subset(model.matrix(Y ~ (A+B)^2,fr))
>>>> my_subset(model.matrix(Y ~ A + B + A:B,fr))
>>>> my_subset(model.matrix(Y ~ A:B - 1,fr))
>>>> my_subset(model.matrix(Y ~ A:B,fr))
>>>>
>>>> On Fri, Mar 5, 2010 at 8:45 AM, Gabor Grothendieck
>>>> <ggrothendieck at gmail.com> wrote:
>>>>
>>>>> The way to understand this is to look at the output of model.matrix:
>>>>>
>>>>> model.matrix(fo, fr)
>>>>>
>>>>> for each formula you tried. If your data is large you will have to
>>>>> use a subset not to be overwhelmed with output.
>>>>>
>>>>> On Fri, Mar 5, 2010 at 9:08 AM, blue sky<bluesky315 at gmail.com> wrote:
>>>>>
>>>>>> Suppose, 'fr' is data.frame with columns 'Y', 'A' and 'B'. 'A' has
>>>>>> levels 'Aa'
>>>>>> 'Ab' and 'Ac', and 'B' has levels 'Ba', 'Bb', 'Bc' and 'Bd'. 'Y'
>>>>>> columns are numbers.
>>>>>>
>>>>>> I tried the following three sets of commands. I understand that A*B is
>>>>>> equivalent to A+B+A:B. However, A:B in A+B+A:B is different from A:B
>>>>>> just by itself (see the 3rd and 4th set of commands). Would you please
>>>>>> help me understand why the meanings of A:B are different in different
>>>>>> contexts?
>>>>>>
>>>>>> I also see the coefficient of AAc:BBd is NA (the last set of
>>>>>> commands). I'm wondering why this coefficient is not removed from the
>>>>>> 'coefficients' vector. Since lm(Y~A) has coefficients for (intercept),
>>>>>> Ab, Ac (there are no NA's), I think that it is reasonable to make sure
>>>>>> that there are no NA's when there are interaction terms, namely, A:B
>>>>>> in this case.
>>>>>>
>>>>>> Thank you for answering my questions!
>>>>>>
>>>>>>
>>>>>>> alm=lm(Y ~ A*B,fr)
>>>>>>> alm$coefficients
>>>>>>>
>>>>>> (Intercept) AAb AAc BBb BBc BBd
>>>>>> -3.548176 -2.086586 7.003857 4.367800 11.887356 -3.470840
>>>>>> AAb:BBb AAc:BBb AAb:BBc AAc:BBc AAb:BBd AAc:BBd
>>>>>> 5.160865 -11.858425 -12.853116 -20.289611 6.727401 -2.327173
>>>>>>
>>>>>>> alm=lm(Y ~ A + B + A:B,fr)
>>>>>>> alm$coefficients
>>>>>>>
>>>>>> (Intercept) AAb AAc BBb BBc BBd
>>>>>> -3.548176 -2.086586 7.003857 4.367800 11.887356 -3.470840
>>>>>> AAb:BBb AAc:BBb AAb:BBc AAc:BBc AAb:BBd AAc:BBd
>>>>>> 5.160865 -11.858425 -12.853116 -20.289611 6.727401 -2.327173
>>>>>>
>>>>>>> alm=lm(Y ~ A:B - 1,fr)
>>>>>>> alm$coefficients
>>>>>>>
>>>>>> AAa:BBa AAb:BBa AAc:BBa AAa:BBb AAb:BBb AAc:BBb
>>>>>> AAa:BBc
>>>>>> -3.5481765 -5.6347625 3.4556808 0.8196231 3.8939016 -4.0349449
>>>>>> 8.3391795
>>>>>> AAb:BBc AAc:BBc AAa:BBd AAb:BBd AAc:BBd
>>>>>> -6.6005222 -4.9465744 -7.0190168 -2.3782017 -2.3423322
>>>>>>
>>>>>>> alm=lm(Y ~ A:B,fr)
>>>>>>> alm$coefficients
>>>>>>>
>>>>>> (Intercept) AAa:BBa AAb:BBa AAc:BBa AAa:BBb AAb:BBb
>>>>>> -2.34233221 -1.20584424 -3.29243033 5.79801305 3.16195534 6.23623377
>>>>>> AAc:BBb AAa:BBc AAb:BBc AAc:BBc AAa:BBd AAb:BBd
>>>>>> -1.69261273 10.68151168 -4.25819000 -2.60424217 -4.67668454 -0.03586951
>>>>>> AAc:BBd
>>>>>> NA
>>>>>>
>>>>>>
>>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
More information about the R-help
mailing list