[R] Simple question about formulae in R!?

Joshua Wiley jwiley.psych at gmail.com
Fri Aug 10 17:09:43 CEST 2012


On Fri, Aug 10, 2012 at 7:39 AM, Henrik Singmann
<henrik.singmann at psychologie.uni-freiburg.de> wrote:
> Dear Michael and list,
>
> R in general tries hard to prohibit this behavior (i.e., including an
> interaction but not the main effect). When removing a main effect and
> leaving the interaction, the number of parameters is not reduced by one (as
> would be expected) but stays the same, at least when using model.matrix:

This is because for factor or character variables, R automatically
dummy codes them (or whatever contrasts were set, dummies are the
default, but others could be set globally which would then propogate
to the specific model.matrix() call).

>
> d <- data.frame(A = rep(c("a1", "a2"), each = 50), B = c("b1", "b2"), value
> = rnorm(10))
> ncol(model.matrix(~ A*B, data = d))
> #  [1] 4
> ncol(model.matrix(~ A*B - A, data = d))
> #  [1] 4
>
> I actually don't know understand how R parametrizes the model in the second
> case, but I am pretty sure someone here might do so and be able to explain.

One way to think about this is the regression versus ANOVA style of
parameterization.  Consider these three simple models

Here we have a traditional regression.  am is a two level factor, so
it gets one dummy code with the intercept.  The model matrix looks
something like:

i a
1 0
1 0
1 0
1 1
1 1
1 1

what does that mean for the expected values?  For a = 0,

yhat = b0 * 1 + b1 * 0 = b0 * 1, the intercept

for a = 1

yhat = b0 * 1 + b1 * 1

what is the difference between these two?  Merely b1.  Hence under
this parameterization, b0 is the expected value for the group a = 0,
and b1 is the difference in expected values between the two groups a =
0 and a = 1.

> coef(lm(mpg ~ 1 + factor(am), data = mtcars))
(Intercept) factor(am)1
  17.147368    7.244939

We can find the expected values for each group.  The first is just b0,
the intercept term.  The secondn is the intercept plus b1, the single
other coefficient.  We can get this just by adding (or summing all the
coefficients) like so:

> sum(coef(lm(mpg ~ 1 + factor(am), data = mtcars)))
[1] 24.39231

Now, if we explicitly force out the intercept, R uses a different
coding scheme.  The intercept normally captures some reference or
baseline group, which every other group is compared to, but if we
force it out, the default design matrix is something like:

1 0
1 0
1 0
0 1
0 1
0 1

this looks very similar to what we had before, only now the two
columns are opposites.  Let's think again what our expected values
are.  For a = 0

yhat = b0 * 1 + b1 * 0 = b0 * 1

for a = 1

yhat = b0 * 0 + b1 * 1 = b1 * 1

now b0 encodes the expected value of the group a = 0 and b1 encodes
the expected value of the group a = 1.  By default these coefficients
are tested against 0.  There is no more explicit test if the two
groups are different from each other because instead of creating
differences, we have created the overall group expectation.  In lm():

> coef(lm(mpg ~ 0 + factor(am), data = mtcars))
factor(am)0 factor(am)1
   17.14737    24.39231

both those numbers should be familiar from before.  In this case, both
models included two parameters, the only difference is whether they
encode group expectations or an expectation and expected difference.
These are equivalenet models, and their likelihood etc. will be
identical.   Note however that if you suppress the intercept, R will
use a different formula for the R squared so those will not look
identical (although if you got predicted values from both models,
yhat, you would find those identical).

This concept generalizes to the interaction of categorical variables
without their main effects.  Normally, the interaction terms are
expected differences from the reference group, but if the "main
effect" is dropped, then instead, the interactions become the expected
values of the various cells (if you think of a 2 x 2 table of group
membership).

For example here it is as usual:

> coef(lm(mpg ~ 1 + factor(vs) * factor(am), data = mtcars))
            (Intercept)             factor(vs)1             factor(am)1
              15.050000                5.692857                4.700000
factor(vs)1:factor(am)1
               2.928571

> 15.050000 + 5.692857
[1] 20.74286
> 15.050000 + 4.7
[1] 19.75
> 15.050000 + 5.692857 + 4.7 + 2.928571
[1] 28.37143

the intercept is the expected value for the group vs = 0 & am = 0,
then you get the difference between expectations for am = 0 & vs = 0
and am = 0 & vs = 1, then likewise for am, and finally, the additional
bit of being am = 1 and vs = 1.  I show the by hand calculations to
get the expected group values.  Right now, we get significance tests
of whether these differences are statistically significantly different
from zero.

If we drop the main effects and intercept, we get:

> coef(lm(mpg ~ 0 + factor(vs) : factor(am), data = mtcars))
factor(vs)0:factor(am)0 factor(vs)1:factor(am)0 factor(vs)0:factor(am)1
               15.05000                20.74286                19.75000
factor(vs)1:factor(am)1
               28.37143

which you can see directly gets all the group expectations.

Cheers,

Josh

> I have asked a question on how to get around this "limitation" on
> stackoverflow with helpful answers by Ben Bolker and Joshua Wiley:
> http://stackoverflow.com/q/11335923/289572
> (this functionality is now used in function mixed() in my new package afex
> for obtaining "type 3" p-values for mixed models)

With few exceptions, I'm in the habit of not giving people type 3
p-values.  People like, but generally do not actually understand them.
 I would also like to point out that I am in psychology, so yes I know
psychology likes them.  Further, I am a doctoral student so it is not
like I am so established that people bow to my every whim, it is work
to explain why I am not sometimes and I have had discussions in
various lab meetings and with advisors about this, but my point is
that it is possible.  I would urge you to do the same and take heart
that there are others working for change---you are not the only one
even if it feels like it at your university.

>
> Cheers,
> Henrik
>
> Am 10.08.2012 15:48, schrieb R. Michael Weylandt:
>
>> On Fri, Aug 10, 2012 at 7:36 AM, ONKELINX, Thierry
>> <Thierry.ONKELINX at inbo.be> wrote:
>>>
>>> Dear Johan,
>>>
>>> Why should it be complicated? You have a very simple model, thus a very
>>> simple formula. Isn't that great?
>>>
>>> Your formula matches the model. Though Trust~Culture + Structure *
>>> Speed_of_Integration is another option. The model fit is the same, the only
>>> difference is the parameterization of the model.
>>
>>
>> Note quite, I don't think: A*B is shorthand for A + B + A:B where A:B
>> is the 2nd order (interaction) term. The OP only had the 2nd order
>> term without the main effects.
>>
>> OP: You almost certainly want A*B -- it's strange (though certainly
>> not impossible) to have good models where you only have interactions
>> but not main effects.
>>
>> Cheers,
>> Michael
>>
>>
>>>
>>> Best regards,
>>>
>>> Thierry
>>>
>>> ir. Thierry Onkelinx
>>> Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>>> and Forest
>>> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
>>> Kliniekstraat 25
>>> 1070 Anderlecht
>>> Belgium
>>> + 32 2 525 02 51
>>> + 32 54 43 61 85
>>> Thierry.Onkelinx at inbo.be
>>> www.inbo.be
>>>
>>> To call in the statistician after the experiment is done may be no more
>>> than asking him to perform a post-mortem examination: he may be able to say
>>> what the experiment died of.
>>> ~ Sir Ronald Aylmer Fisher
>>>
>>> The plural of anecdote is not data.
>>> ~ Roger Brinner
>>>
>>> The combination of some data and an aching desire for an answer does not
>>> ensure that a reasonable answer can be extracted from a given body of data.
>>> ~ John Tukey
>>>
>>>
>>> -----Oorspronkelijk bericht-----
>>> Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>> Namens Johan Haasnoot
>>> Verzonden: vrijdag 10 augustus 2012 9:15
>>> Aan: r-help at r-project.org
>>> Onderwerp: [R] Simple question about formulae in R!?
>>>
>>> Good morning reader,
>>>
>>>
>>>
>>> I have encountered a, probably, simple issue with respect to the
>>> *formulae* of a *regression model* I want to use in my research. I'm
>>> researching alliances as part of my study Business Economics (focus
>>> Strategy) at the Vrije Universiteit in Amsterdam. In the research model I
>>> use a moderating variable, I'm looking for confirmation or help on the
>>> formulation of the model.
>>>
>>>
>>>
>>> The research model consist of 2 explanatory variables, a moderating
>>> variable and 1 response variable. The first explanatory variable is Culture,
>>> measured on a nominal scale and the second is Structure of the alliance,
>>> also measured on a nominal scale. The moderating variable in the relation
>>> towards Trust is Speed of Integration, measured as an interval.
>>> The response variable is Trust, measured on a nominal scale (highly
>>> likely a 5-point Likert scale). Given the research model and the measurement
>>> scales, I intent to use a ordered probit model, often used in Marketing
>>> models,  to execute the regression modelling. I can't seem to find
>>> confirmation on how to model the formulae. I have been reading and studying
>>> R! for a couple of weeks now, read a lot of books from the R! series in the
>>> past, but I can't get a grasp on this seemingly simple formulae. I think I
>>> understand how to model multinomial regression (using the R-package MNP),
>>> how to execute a Principal Components Analysis and an Explanatory Factor
>>> analysis (obviously I'm using a questionnaire to collect my data), but the
>>> formulae itself seems to be to simple.
>>>
>>>
>>>
>>> I expect to use the interaction symbol: "Trust~Culture + Structure :
>>> Speed_of_Integration" for the formulae, but is it really this simple? Can
>>> anybody confirm this or help me, advise me on this issue?
>>>
>>>
>>> Kind regards,
>>>
>>> --
>>>
>>> Met vriendelijke groet,
>>>
>>> Johan Haasnoot
>>>
>>> De Haan & Martojo
>>> Kerklaan 5
>>> 3645 ES Vinkeveen
>>> Telefoon: 0297-266354
>>> Mobiel: 06-81827665
>>> Email: johan.haasnoot at dh-m.nl
>>> Website: www.dehaanenmartojo.nl
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> * * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * * * * *
>>> Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver
>>> weer en binden het INBO onder geen enkel beding, zolang dit bericht niet
>>> bevestigd is door een geldig ondertekend document.
>>> The views expressed in this message and any annex are purely those of the
>>> writer and may not be regarded as stating an official position of INBO, as
>>> long as the message is not confirmed by a duly signed document.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> --
> Dipl. Psych. Henrik Singmann
> PhD Student
> Albert-Ludwigs-Universität Freiburg, Germany
> http://www.psychologie.uni-freiburg.de/Members/singmann
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list