[Bioc-devel] Making hypothesis testing easier with design matrices?

Mon Dec 10 11:12:44 CET 2012

Hi Gordon and list,

I've been thinking about how to make it easier to specify what 
hypotheses one wants to test in microarray or RNA-seq differential 
expression data sets, and I think one of the major stumbling blocks that 
confuses people is the way in which design matrices must have one 
coefficient "missing" for each term. So if you have several experimental 
factors and blocking factors, you can't have a column of the design 
matrix corresponding to every level of every factor in the formula. But 
I believe every level of every factor could be represented as a contrast 
of one or more coefficients in the design matrix. For example, if you 
had a variable "cond" with 3 levels "A", "B", and "C", and you did 
"model.matrix(~1+condition)", you get a design matrix with an intercept, 
a B-A term, and a C-A term, with names "Intercept", "condB", and 
"condC". From this, you could solve for the contrast representing each 
of A, B and C. For example, I believe A is "Intercept - (condB + 
condC)/3". Expressed as a contrast vector in R, this would be "c(1, 
-1/3, -1/3)". (Of course, for this trivial example one can just do 
"model.matrix(~0+cond)", but that doesn't work for all the factors in a 
multi-factor design.)

So, in the same step as the design matrix is created, the function could 
also return, regardless of how the model formula was parametrized, a 
matrix where each column is the contrast corresponding to one level of 
one of the factors in the model formula. (This could be added as an 
attribute on the design matrix, for example.) The user could then add 
and subtract these columns (perhaps with a helper function similar to 
makeContrasts that allows it to be done symbolically) to get the 
contrasts that they want without having to worry about exactly how the 
contrasts are coded into the design matrix. Obviously, for multi-factor 
designs, this matrix of factor levels coded as contrasts would have more 
columns than the design matrix itself. For example, if an experiment has 
a 2-level factor and a 3-level factor, then the design matrix would have 
4 columns, but the "available factor level matrix" would have 5 columns.

The advantage of such a scheme would be that the computer can tell the 
user in addition to the coefficients in the design matrix, "here are the 
available factor levels that you can perform comparisons on", and the 
user could pick the ones they are interested in and and add/subtract 
them to get the test they want.

What do you think of this idea? Could it work in practice for limma and 
edgeR? I would be interested in writing code to make it a reality if you 
thought it was worthwhile.

Sincerely,
-Ryan Thompson