[Bioc-devel] Making hypothesis testing easier with design matrices?

Gordon K Smyth smyth at wehi.EDU.AU
Tue Dec 11 02:56:07 CET 2012

Dear Ryan,

Thanks for your suggestion.  I think though that the attribute that you 
are thinking of implementing is not actually something that exists in 

On Mon, 10 Dec 2012, Ryan C. Thompson wrote:

> Hi Gordon and list,
> I've been thinking about how to make it easier to specify what hypotheses one 
> wants to test in microarray or RNA-seq differential expression data sets, and 
> I think one of the major stumbling blocks that confuses people is the way in 
> which design matrices must have one coefficient "missing" for each term. So 
> if you have several experimental factors and blocking factors, you can't have 
> a column of the design matrix corresponding to every level of every factor in 
> the formula.

I hope you understand why this has to be so.

> But I believe every level of every factor could be represented as a 
> contrast of one or more coefficients in the design matrix.

This is so only for one-way designs, i.e., for single factor experiments.

> For example, if you had a variable "cond" with 3 levels "A", "B", and 
> "C", and you did "model.matrix(~1+condition)",

I think you mean model.matrix(~cond)

> you get a design matrix with an intercept, a B-A term, and a C-A term, 
> with names "Intercept", "condB", and "condC".

Actually the name of the first coefficient would be "(Intercept)".

> this, you could solve for the contrast representing each of A, B and C. 
> For example, I believe A is "Intercept - (condB + condC)/3".

No.  I'm not sure what you have in mind by "A" but, if you mean the 
average log-expression when the condition is at level A, then it would be 
just "(Intercept)".

The contrast you give doesn't represent anything of interest.

> Expressed as a contrast vector in R, this would be "c(1, -1/3, -1/3)". 
> (Of course, for this trivial example one can just do 
> "model.matrix(~0+cond)", but that doesn't work for all the factors in a 
> multi-factor design.)
> So, in the same step as the design matrix is created, the function could also 
> return, regardless of how the model formula was parametrized, a matrix where 
> each column is the contrast corresponding to one level of one of the factors 
> in the model formula.

No, there isn't any such thing as a contrast corresponding to one level of 
one of the factors.  For example, consider the additive model ~a+b where 
"a" and "b" are factors with levels 1,2,...  Then it just doesn't make 
sense to talk about the effect for b1 as an isolated concept.

The model has fitted values, which require the levels of both factors to 
be specified simultaneously, and it is possible to make tests for 
combinations of coefficients.  There isn't anything else that is 

Suppose that "b" has three levels.  You might not be appreciating the fact 
that the two coefficients b2 and b3 do contain all the information about 
comparisons between levels of b.  The intercept term is not involved. 
You can test for b2, for b3, or for b2-b3.  In terms of pairwise 
comparisons, there's nothing else.

> (This could be added as an attribute on the design matrix, for example.) 
> The user could then add and subtract these columns (perhaps with a 
> helper function similar to makeContrasts that allows it to be done 
> symbolically) to get the contrasts that they want without having to 
> worry about exactly how the contrasts are coded into the design matrix. 
> Obviously, for multi-factor designs, this matrix of factor levels coded 
> as contrasts would have more columns than the design matrix itself. For 
> example, if an experiment has a 2-level factor and a 3-level factor, 
> then the design matrix would have 4 columns, but the "available factor 
> level matrix" would have 5 columns.

Another thing to consider is interactions. For models including 
interactions, then making comparisons between levels of one factor in 
isolation is not scientifically sensible.

In the edgeR and limma User's Guide, I recommend that most users may find 
it simplest to think of interaction models in terms of one-way layouts.

For additive models however, I think there is no shortcut to a user trying 
to understand what the fitted coefficients represent.

Best wishes

> The advantage of such a scheme would be that the computer can tell the 
> user in addition to the coefficients in the design matrix, "here are the 
> available factor levels that you can perform comparisons on", and the 
> user could pick the ones they are interested in and and add/subtract 
> them to get the test they want.
> What do you think of this idea? Could it work in practice for limma and 
> edgeR? I would be interested in writing code to make it a reality if you 
> thought it was worthwhile.
> Sincerely,
> -Ryan Thompson

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioc-devel mailing list