# [Bioc-devel] Making hypothesis testing easier with design matrices?

Ryan C. Thompson rct at thompsonclan.org
Mon Dec 10 11:12:44 CET 2012

```Hi Gordon and list,

I've been thinking about how to make it easier to specify what
hypotheses one wants to test in microarray or RNA-seq differential
expression data sets, and I think one of the major stumbling blocks that
confuses people is the way in which design matrices must have one
coefficient "missing" for each term. So if you have several experimental
factors and blocking factors, you can't have a column of the design
matrix corresponding to every level of every factor in the formula. But
I believe every level of every factor could be represented as a contrast
of one or more coefficients in the design matrix. For example, if you
had a variable "cond" with 3 levels "A", "B", and "C", and you did
"model.matrix(~1+condition)", you get a design matrix with an intercept,
a B-A term, and a C-A term, with names "Intercept", "condB", and
"condC". From this, you could solve for the contrast representing each
of A, B and C. For example, I believe A is "Intercept - (condB +
condC)/3". Expressed as a contrast vector in R, this would be "c(1,
-1/3, -1/3)". (Of course, for this trivial example one can just do
"model.matrix(~0+cond)", but that doesn't work for all the factors in a
multi-factor design.)

So, in the same step as the design matrix is created, the function could
also return, regardless of how the model formula was parametrized, a
matrix where each column is the contrast corresponding to one level of
one of the factors in the model formula. (This could be added as an
attribute on the design matrix, for example.) The user could then add
and subtract these columns (perhaps with a helper function similar to
makeContrasts that allows it to be done symbolically) to get the
contrasts that they want without having to worry about exactly how the
contrasts are coded into the design matrix. Obviously, for multi-factor
designs, this matrix of factor levels coded as contrasts would have more
columns than the design matrix itself. For example, if an experiment has
a 2-level factor and a 3-level factor, then the design matrix would have
4 columns, but the "available factor level matrix" would have 5 columns.

The advantage of such a scheme would be that the computer can tell the
user in addition to the coefficients in the design matrix, "here are the
available factor levels that you can perform comparisons on", and the
user could pick the ones they are interested in and and add/subtract
them to get the test they want.

What do you think of this idea? Could it work in practice for limma and
edgeR? I would be interested in writing code to make it a reality if you
thought it was worthwhile.

Sincerely,
-Ryan Thompson

```