[BioC] Finding differentially expressed genes using Ebayes

James W. MacDonald jmacdon at med.umich.edu
Tue May 12 15:22:01 CEST 2009


Hi John,

Please don't take things off-list. What I said yesterday was a mistake, 
but most answers are hoped to be reasonable, and thus the archives are 
considered to be a knowledge repository that people can search.

What I meant to say was

design <- model.matrix(~factor(rep(1:2, each=3), levels=2:1))

which will give you this

 > design
   (Intercept) factor(rep(1:2, each = 3), levels = 2:1)1
1           1                                         1
2           1                                         1
3           1                                         1
4           1                                         0
5           1                                         0
6           1                                         0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`factor(rep(1:2, each = 3), levels = 2:1)`
[1] "contr.treatment"


And the parameterization will now be

Intercept = control
factor1 = cancer - control

You could also add

colnames(design) <- c("control","cancer - control")

to make it more readable.

As for a layman's explanation, that's sort of tough, especially when 
trying to explain algebra using ascii text with no subscripts (the 
convention is to use the underscore, so y_1 means y subscript i). But 
here goes...

With this design matrix you are trying to simultaneously solve two 
equations:

y_1 = control + (cancer - control) => y_1 = cancer
y_4 = control

Since we have replicates we are actually computing the mean of the 
cancer and control samples, so there is an error term that captures the 
difference between the mean and the actual value.

So really what we have are

y_1 = mean(cancer samples) + error
y_4 = mean(control samples) + error

Where the error term is just the difference between the observed 
expression value for that sample and the mean for that sample type, so 
for one gene there are six observed (y) terms, two mean values (cancer 
and control) and six error terms.

You could write this all out as one function like this:

y_i = control*X1_i + (cancer - control)*X2_i + error_i

Which is where the design matrix comes in. The design matrix gives you 
the X1 and X2 values. So if i = 1, then you have

y_1 = control*1 + (cancer - control)*1 + error_1 =>
y_1 = cancer + error_1

if i = 4 (the fourth row of the design matrix) you have

y_4 = control*1 + (cancer - control)*0 + error_4 =>
y_4 = control + error_4

Does that help?

Best,

Jim



John Herbert wrote:
> Hello James.
> Thanks for helping me out yesterday. I like to understand what's going on and I am struggling a bit. So your line of code you suggest does not work for me. 
> 
> fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1)
> 
> It looks like it is missing a ")" bracket, and is it missing the data?
> 
> It would be great if you could explain me the design matrix in lamen terms? I don't seems to get what all the 1s and 0s represent but I am reading about it. 
> 
> Thanks a lot for any help. 
> 
> Kind regards,
> 
> John.
> 
> =================================
> Bioinformatics Officer
> 
> Molecular Angiogenesis group
> First Floor, Institute of Biomedical Research
> The University of Birmingham
> Medical School
> Edgbaston
> Birmingham
> B15 2TT  (UK)
> 
> j.m.herbert at bham.ac.uk
> 
> Tel:  +44 121 414 3733
> Fax: +44 121 415 8677 
> 
> 
> 
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at med.umich.edu]
> Sent: Mon 11/05/2009 14:34
> To: John Herbert
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Finding differentially expressed genes using Ebayes
>  
> Hi John,
> 
> John Herbert wrote:
>> Hello.
>> I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly. 
>>
>> So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each).  
>>
>> design <- model.matrix(~factor(c(1,1,1,2,2,2)));
>>
>> fit <- lmFit(probes_only.imp, design);
>>
>> fit <- eBayes(fit);
>>
>> probes_only.imp_50_ebayes <- topTable(fit, n=20);
> 
> The model you are fitting has an intercept, which in this case measures 
> the average expression of the cancer genes. So your topTable will just 
> give those genes that have an average expression different from zero (an 
> uninteresting result).
> 
> If you do topTable(fit, coef = 2), you will get the genes that are 
> different between cancer and normal, but the direction might be 
> different than what you expect (the second coefficient is the difference 
> between normal and cancer, e.g., normal - cancer), so a negative result 
> indicates upregulation in cancer. This might be a bit counterintuitive, 
> so you could simply replace your call to lmFit() with
> 
> fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1)
> 
> which will re-order things so coefficient 2 will be cancer - normal.
> 
> Best,
> 
> Jim
> 
> 
>> Any help appreciated on this, thank you. 
>>
>> Kind regards,
>>
>> John.
>>
>>
>> =================================
>> Bioinformatics Officer
>>
>> Molecular Angiogenesis group
>> First Floor, Institute of Biomedical Research
>> The University of Birmingham
>> Medical School
>> Edgbaston
>> Birmingham
>> B15 2TT  (UK)
>>
>> j.m.herbert at bham.ac.uk
>>
>> Tel:  +44 121 414 3733
>> Fax: +44 121 415 8677 
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826



More information about the Bioconductor mailing list