[BioC] limma design matrix
Gordon Smyth
smyth at wehi.edu.au
Mon Mar 1 03:12:41 MET 2004
Dear Michael,
Thanks for your comments on limma and for your email. I definitely agree
with you that constructing the design and contrasts matrices are the
hardest part of using limma - the amount of time I spend advising people on
this makes me very aware of it. You're not the only person to have asked
about factorial models, so I'll try to address some general issues.
Firstly, let me say that the factorial example in the User's Guide is out
of data. The analysis would be rather simpler now using newer features of
the designMatrix() and contrast.fit() functions. I won't go through the
newer analysis here because the Weaver experiment is a direct design while
you have a common reference design - a bit simpler.
Secondly, why doesn't limma allow you to use factorial formula like 'data ~
c*b*t' to construct your design matrix? There are a number of reasons.
a) ANOVA is not as simple as it appears. In you were to use the aov()
function in the data example that you give below, you would not get out the
interactions and F-statistics that you expect. There just isn't enough data
to use the factorial model approach. Even if there was more data,
microarray data is usually unbalanced because of quality weights or missing
data so there won't be a unique anova table.
b) The formula aov() approach doesn't apply without further development to
direct comparison two-color designs or to designs including dye-swaps where
the dye-swap is not a factor to be fitted.
c) The most important reason though relates to the interpretation of the
factorial model. The division of sums of squares into mean effects and
interactions is a generally useful technique but the interactions very
often do not correspond to contrasts of interest in themselves. A factorial
ANOVA analysis usually proceeds in several steps. One first looks at the
interactions to get an idea of the complexity of the data. If there aren't
any interactions, then one can interpret the main effects directly. If
there are, then one has to look at the data in more detail to determine
which specific contrasts are causing the interactions. I can't see how one
can realistically go through this process for each of 20,000 ANOVA tables.
I think that it is best to instead specify the contrasts of interest in
advance. Unfortunately this means that you will need to do something
different for each analysis - it is hard to automate it.
Thirdly, note that if you really do have a formula which corresponds to a
model that you want to fit, you can always use it in limma. Just use
design <- model.matrix( formula )
and then input that design matrix to lmFit().
In your three way factorial example, I really think that you may want to
neglect the interactions between animal and infection & time, i.e., you
want a model like
data ~ animal + infection*time
However you need to know what all the fitted parameters mean, which you
probably won't if you just let aov() or model.matrix() make a design matrix
for you! Here is a suggestion. Suppose you have a targets file like this:
> targets <- readTargets("temp.txt")
> targets
Cy3 Cy5
1 Reference Animal1UninfEarly
2 Reference Animal1UninfLate
3 Reference Animal1InfEarly
4 Reference Animla1InfLate
5 Reference Animal2UninfEarly
6 Reference Animal2UninfLate
7 Reference Animal2InfEarly
8 Reference Animal2InfLate
Now specify what you'd like the parameters in your linear model to mean:
> parameters <- cbind(
+ UninfEarly=c(-1,0.5,0,0,0,0.5,0,0,0),
+ UninfLate=c(-1,0,0.5,0,0,0,0.5,0,0,),
+ InfEarly=c(-1,0,0,0.5,0,0,0,0.5,0),
+ InfLate=c(-1,0,0,0,0.5,0,0,0,0.5,),
+ AnimalDiff=c(0,-0.25,-0.25,-0.25,-0.25,0.25,0.25,0.25,0.25)
+ )
> rownames(parameters) <- c("Reference",targets$Cy5)
> parameters
UninfEarly UninfLate InfEarly InfLate AnimalDiff
Reference -1.0 -1.0 -1.0 -1.0 0.00
Animal1UninfEarly 0.5 0.0 0.0 0.0 -0.25
Animal1UninfLate 0.0 0.5 0.0 0.0 -0.25
Animal1InfEarly 0.0 0.0 0.5 0.0 -0.25
Animla1InfLate 0.0 0.0 0.0 0.5 -0.25
Animal2UninfEarly 0.5 0.0 0.0 0.0 0.25
Animal2UninfLate 0.0 0.5 0.0 0.0 0.25
Animal2InfEarly 0.0 0.0 0.5 0.0 0.25
Animal2InfLate 0.0 0.0 0.0 0.5 0.25
This means you want five coefficients. The first corresponds to the average
log-ratio of uninfected animals at the early time versus the common
reference. The second is the average log-ratio of uninfected animals at the
later time, etc. The fifth coefficient represents the difference between
your two animals, a nuisance parameter probably. There are other ways to do
this, the coefficients I've chosen are only one choice.
Now compute the design matrix. You don't need to try to understand the
detailed entries in the design matrix to interpret the coefficients.
> design <- designMatrix(targets,parameters)
Found unique target names:
Reference Animal1UninfEarly Animal1UninfLate Animal1InfEarly
Animla1InfLate Animal2UninfEarly Animal2UninfLate Animal2InfEarly
Animal2InfLate
Warning message:
number of parameters should be one less than number of targets in:
designMatrix(targets, parameters)
> design
UninfEarly UninfLate InfEarly InfLate AnimalDiff
1 1 0 0 0 -0.5
2 0 1 0 0 -0.5
3 0 0 1 0 -0.5
4 0 0 0 1 -0.5
5 1 0 0 0 0.5
6 0 1 0 0 0.5
7 0 0 1 0 0.5
8 0 0 0 1 0.5
> fit <- lmFit(MA, design)
Now you can start asking quesions. For example, what is the effect of
infection at the early time? What is the effect of infection at the late
time? Does the effect of infection change between the late and early times?
(This is usually called interaction.)
cont.matrix <-
cbind(InfvsUninfEarly=c(-1,0,1,0,0),InfvsUninfLate=c(0,-1,0,1,0),EarlyvsLateEffect=c(1,-1,-1,1,0))
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
Note that you need a recent version of limma for this to work.
Regards
Gordon
At 03:42 AM 28/02/2004, michael watson (IAH-C) wrote:
>Hi
>
>First off, let me say that i think limma is a quite brilliant package and
>I use it a lot. However, one of the biggest obstructions to using limma
>for the lay biologist is an inability to understand the design
>matrix. Although there is a lot of documentation, showing the design
>matrix for a number of example problems, there is no discussion as to how
>that design matrix was constructed i.e. the logical thought processes that
>went into it.
>
>For the two-sample experiment given in the UserGuide, I understand that
>there must be one row per array in my design matrix and the columns
>represent the coefficients I want to calculate. I represent the
>differences in the factors with 1's and 0's. Great, this is pretty
>similar to how I do it for aov().
>
>But then, all of a sudden, for the factorial experiment the design matrix
>not only has -1's in there, but also a column for the interactions. How
>do I decide which array/factor combination gets a 1, a 0 or a -1?
>
>Let me put this in perspective. I have a 3 factor experiment where the
>factors are animal, infected/uninfected and time. All samples were
>hybridised against a common reference. For analysis of variance, all I do
>is set up a data.frame that looks like this for each gene:
>
> data c b t
>1 2.9 1 1 1
>2 2.7 1 0 2
>3 2.8 1 1 1
>4 3.0 1 0 2
>5 -3.0 0 1 1
>6 -3.5 0 0 2
>7 -4.0 0 1 1
>8 -5.0 0 0 2
>
>where data is my data, and c, b and t are my factors, and then feed in
>something like:
>
>(aov.aov <- aov(data ~ c*b*t, aov.data))
>
>and I get F-statistics for c, b and t and all possible interactions.
>
>Because of the limitations of analysis of variance for my microarray data,
>I would like to use limma. Is there any *more* documentation I can look
>at that will tell me the steps to take to work out what my limma design
>matrix will look like?
>
>Kind regards
>
>Michael
More information about the Bioconductor
mailing list