[BioC] design and contrast matrix for limma time series without replicates

Tue Mar 20 19:51:19 CET 2012

Hi Xuan,

On 3/20/2012 2:13 PM, shao chunxuan wrote:
> Hi Jim,
>
> Thanks for the reply,  but I still  have few more questions on the 
> regression model of limma.
>
> Time points are not evenly spreaded, the real experiments  like this:
>
> xx.1 <- c(0.5,1,3,6,12,24,72)
> time <- rep(xx.1,2)
>
> The design matrix is
> trt <- factor(rep(0:1, each = 7))
> design=model.matrix(~time+trt)
> colnames(design) <- c("Intercept","time","treat")
>
> > design
>    Intercept time treat
> 1          1  0.5     0
> 2          1  1.0     0
> 3          1  3.0     0
> 4          1  6.0     0
> 5          1 12.0     0
> 6          1 24.0     0
> 7          1 72.0     0
> 8          1  0.5     1
> 9          1  1.0     1
> 10         1  3.0     1
> 11         1  6.0     1
> 12         1 12.0     1
> 13         1 24.0     1
> 14         1 72.0     1
> attr(,"assign")
> [1] 0 1 2
> attr(,"contrasts")
> attr(,"contrasts")$trt
> [1] "contr.treatment"
>
> Here is my understanding of what happens in limma
>
> In this case, Iimma is supposed to fit:
> y=p0 + p1*time + p2*treat, and y=p0 + p1*time,
> while time is quantitative variable, treat is group variable.
> A F-statistic will be calculated and the corresponding pvalue will be 
> used to determine significance.
>
> Is it right to understand in this way?
>
> The implementation is :
> fit <- lmFit(ts.data, design) ## ts.data is the time-course data.
> fit <- eBayes(fit)
> xx.1 <- topTable(fit, coef="treat", adjust="BH")
> xx.2 <- topTable(fit, coef="time", adjust="BH")
>
> So xx.1  means the differentially expressed genes between the two 
> treatments, after adjusting for the time effect, right? what does xx.2 
> mean?

xx.2 gives you those genes where the slope of the fitted line is 
significantly different from zero. In other words, these are the genes 
that appear to go up or down as time progresses.

You should note two things here. First, since you don't have replicates, 
we are forced to use the time points as continuous covariates rather 
than factor levels. This means that you are looking for a linear 
response to time (e.g., genes that go up or down more as time 
progresses). This is not normally what one would want to do, as it is 
more likely that a gene will go up for a bit and then go back down (or 
vice versa). To account for that curvature in the expression, you might 
want to add a time^2 covariate as well.

Second, linear regression is in some ways more complicated than analysis 
of variance, and if you were just fitting a single model you would be 
able to assess how well your model was conforming to the assumptions of 
regression, test to see if you need a quadratic time term, look for data 
with high leverage that might be skewing your results, etc. For ANOVA, 
you are in general just testing to see if the mean of the groups are 
different.

When doing microarray analysis, you just fit what you hope is a 
generally reasonable model and let it rip on thousands of genes at once. 
There is no way to check all the little details of a linear regression 
on thousands of models, so you may (will?) have numerous genes that are 
clearly mis-specified by the model you fit, or have glaring 
inconsistencies that cause them to appear significant when it is just 
that the model is really wrong.

So that is my long winded way of saying that you are not IMO analyzing 
your data in an optimal way, but you are forced to do so by the lack of 
replication. So for each 'significant' gene you find, you should look at 
plots of expression vs time prior to attempting validation.

> I am more interested in finding genes changed in time course in 
> treatment, after adjusting for the control.
> Is tm*trt in  design2 what I need?

Not really. The model you are fitting here (xx.2) gives the genes that 
change linearly with time, after adjusting for control. The assumptions 
you are making are that

a) Genes go up or down in a linear fashion (e.g., you can plot 
expression values on the vertical axis and time on the horizontal axis 
and then draw a line over the data, and it looks 'reasonable').
b) The only difference between control and treatment is a difference in 
the intercept, so you assume the same slope for control and treated samples.

adding a tm:trt1 coefficient allows the slopes to be different as well. 
This coefficient tests for a significant difference between slopes (for 
example, it might be that geneX goes up in treated, but down in control).

Best,

Jim

>
> Best,
>
>
> On Tue, Mar 20, 2012 at 2:48 PM, James W. MacDonald <jmacdon at uw.edu 
> <mailto:jmacdon at uw.edu>> wrote:
>
>     Hi Xuan,
>
>
>     On 3/20/2012 6:50 AM, Chunxuan Shao wrote:
>
>         Hi everyone:
>
>         I have one microarray data set considering differentiation in
>         a cell line. About 7 time points for both control and
>         treatment, no replicates
>         I would like to use limma to find the differentially expressed
>         genes between time points for control and treatment, and want
>         to compare the gene expression between control and treatment
>         for the same time point. But I don't know how to make the
>         right design and contrast matrix.
>
>         After searching mail archive, the closest related answer is
>         "https://stat.ethz.ch/pipermail/bioconductor/2010-June/033849.html",
>         which suggest:
>
>         "
>         time=1:10
>         design=model.matrix(~time)
>         "
>
>         In my case, is it correct to set this?
>
>         time=rep(1:7,2)
>         design=model.matrix(~time)
>
>
>     Probably not. This implies that your time points are equally
>     spaced, with one time period between them (e.g., you collected
>     samples after 1,2,3,4,5,6,7 hours or days or some other period).
>     If you didn't use equally spaced time points, you need to change
>     to reflect that.
>
>     In addition, you need to set the design matrix up to include the
>     control/treatment comparison. If I assume you are in fact using
>     seven equally spaced time intervals, then you would want:
>
>     tm <- rep(1:7,2)
>     trt <- factor(rep(0:1, each = 7))
>
>     design <- model.matrix(~tm+trt)
>
>     And the trt1 coefficient estimates the difference in the
>     intercepts, which is what you are looking for. In other words,
>     this is fitting a model where you are allowing the two sample
>     types to have different intercepts, but assuming the same slope.
>     If the intercepts are different, then one sample type has overall
>     higher expression than the other.
>
>     You could also allow for different slopes by adding a
>     time/treatment interaction term:
>
>     design2 <- model.matrix(~tm*trt)
>
>     Here the main coefficient of interest would be tm:trt1, which
>     measures the difference in slope between the treatment and control.
>
>     Best,
>
>     Jim
>
>
>
>             design
>
>            (Intercept) time
>         1            1    1
>         2            1    2
>         3            1    3
>         4            1    4
>         5            1    5
>         6            1    6
>         7            1    7
>         8            1    1
>         9            1    2
>         10           1    3
>         11           1    4
>         12           1    5
>         13           1    6
>         14           1    7
>         attr(,"assign")
>         [1] 0 1
>
>
>         Then how to set the contrast matrix?
>
>
>         Thanks!
>
>         --
>         xuan
>
>
>
>                [[alternative HTML version deleted]]
>
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>     -- 
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>
>
>
> -- 
> xuan

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099