[R] Why does the order of terms in a formula translate into different models/ model matrices?
peter dalgaard
pdalgd at gmail.com
Tue May 1 15:02:44 CEST 2012
I know this is three months old, but I had put it on a mental TODO list to see whether it was something that could be fixed, and I finally got a little spare time for that sort of thing. It turns out that it can NOT be fixed (without fundamental design changes), so I thought I'd better record the conclusion for the archives.
The reason is pretty simple: It is terms.formula that decides which terms require contrast coding and which need dummy coding. It does this based on the formula structure _only_; i.e., it has no information on whether variables are factors or numerical. E.g., you can do
> terms(~ a:x + a:b)
~a:x + a:b
attr(,"variables")
list(a, x, b)
attr(,"factors")
a:x a:b
a 2 2
x 2 0
b 0 1
attr(,"term.labels")
[1] "a:x" "a:b"
attr(,"order")
[1] 2 2
attr(,"intercept")
[1] 1
attr(,"response")
[1] 0
attr(,".Environment")
<environment: R_GlobalEnv>
in which there is no indication to terms() what kind of variable "x" is. So if you want different behavior if x is a factor than if it is continuous, you're out of luck...
Not quite sure the last three lines of my note below make sense, though.
-pd
On Jan 29, 2012, at 11:24 , peter dalgaard wrote:
>
> On Jan 29, 2012, at 02:42 , Ben Bolker wrote:
>>>
>>
>> (quoting removed to make Gmane happy)
>>
>> AFAICS, this is a bug.
>>
>>
>> I think so too, although I haven't got my head around it yet.
>>
>> Chuck, are you willing to post a summary of this to r-devel
>> for discussion ... and/or post a bug report?
>
> You have to be very specific about what the bug is, if any... I.e., which precisely are the rules that are broken by the current behavior?
>
> Preferably also suggest a fix --- the corner cases of model.matrix and friends is some of the more impenetrable code in the R sources.
>
> Notice that order dependent parametrization of terms is not a bug per se, nor is the automatic switch to dummy coding of factors. Consider these cases:
>
> d <- cbind(expand.grid(a=c("a","b"),b=c("X","Y"),c=c("U","W")),x=1:8)
> model.matrix(~ a:b + a:c, d)
> model.matrix(~ a:c + a:b, d)
> model.matrix(~ a:b + a:x, d)
> model.matrix(~ a:x + a:b, d)
>
> and notice that the logic applying to "x" is the same as that applying to "c".
>
> The crux seems to be that the model ~a:c contains the model ~a whereas ~a:x does not, and hence the rationale for _not_ expanding a subsequent a:b term to dummies (namely that ~a is "already in" the model) fails.
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
>
>
>
>
>
>
>
>
--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help
mailing list