[BioC] Finding differentially expressed genes using Ebayes
James W. MacDonald
jmacdon at med.umich.edu
Tue May 12 15:22:01 CEST 2009
Hi John,
Please don't take things off-list. What I said yesterday was a mistake,
but most answers are hoped to be reasonable, and thus the archives are
considered to be a knowledge repository that people can search.
What I meant to say was
design <- model.matrix(~factor(rep(1:2, each=3), levels=2:1))
which will give you this
> design
(Intercept) factor(rep(1:2, each = 3), levels = 2:1)1
1 1 1
2 1 1
3 1 1
4 1 0
5 1 0
6 1 0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`factor(rep(1:2, each = 3), levels = 2:1)`
[1] "contr.treatment"
And the parameterization will now be
Intercept = control
factor1 = cancer - control
You could also add
colnames(design) <- c("control","cancer - control")
to make it more readable.
As for a layman's explanation, that's sort of tough, especially when
trying to explain algebra using ascii text with no subscripts (the
convention is to use the underscore, so y_1 means y subscript i). But
here goes...
With this design matrix you are trying to simultaneously solve two
equations:
y_1 = control + (cancer - control) => y_1 = cancer
y_4 = control
Since we have replicates we are actually computing the mean of the
cancer and control samples, so there is an error term that captures the
difference between the mean and the actual value.
So really what we have are
y_1 = mean(cancer samples) + error
y_4 = mean(control samples) + error
Where the error term is just the difference between the observed
expression value for that sample and the mean for that sample type, so
for one gene there are six observed (y) terms, two mean values (cancer
and control) and six error terms.
You could write this all out as one function like this:
y_i = control*X1_i + (cancer - control)*X2_i + error_i
Which is where the design matrix comes in. The design matrix gives you
the X1 and X2 values. So if i = 1, then you have
y_1 = control*1 + (cancer - control)*1 + error_1 =>
y_1 = cancer + error_1
if i = 4 (the fourth row of the design matrix) you have
y_4 = control*1 + (cancer - control)*0 + error_4 =>
y_4 = control + error_4
Does that help?
Best,
Jim
John Herbert wrote:
> Hello James.
> Thanks for helping me out yesterday. I like to understand what's going on and I am struggling a bit. So your line of code you suggest does not work for me.
>
> fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1)
>
> It looks like it is missing a ")" bracket, and is it missing the data?
>
> It would be great if you could explain me the design matrix in lamen terms? I don't seems to get what all the 1s and 0s represent but I am reading about it.
>
> Thanks a lot for any help.
>
> Kind regards,
>
> John.
>
> =================================
> Bioinformatics Officer
>
> Molecular Angiogenesis group
> First Floor, Institute of Biomedical Research
> The University of Birmingham
> Medical School
> Edgbaston
> Birmingham
> B15 2TT (UK)
>
> j.m.herbert at bham.ac.uk
>
> Tel: +44 121 414 3733
> Fax: +44 121 415 8677
>
>
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at med.umich.edu]
> Sent: Mon 11/05/2009 14:34
> To: John Herbert
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Finding differentially expressed genes using Ebayes
>
> Hi John,
>
> John Herbert wrote:
>> Hello.
>> I have 6 gpr files, 3 from cancer and 3 from normals. I have normalised them with printtiplowess and removed the control spots. My question is about finding differentially expressed genes with Ebayes. I have run the code below and get some strange results that are very different to using Samr. I am thinking I trust samr more as when I look at the results by eye, they seem to order the genes coffectly based on fold ratio. However, it could be that I am not using ebayes correctly.
>>
>> So, the design is supposed to represent cancer (the 1s) and normal (the 2s). Is this the correct way to construct a design matrix? I want to compare gene expression between cancer and normal (3 replicates each).
>>
>> design <- model.matrix(~factor(c(1,1,1,2,2,2)));
>>
>> fit <- lmFit(probes_only.imp, design);
>>
>> fit <- eBayes(fit);
>>
>> probes_only.imp_50_ebayes <- topTable(fit, n=20);
>
> The model you are fitting has an intercept, which in this case measures
> the average expression of the cancer genes. So your topTable will just
> give those genes that have an average expression different from zero (an
> uninteresting result).
>
> If you do topTable(fit, coef = 2), you will get the genes that are
> different between cancer and normal, but the direction might be
> different than what you expect (the second coefficient is the difference
> between normal and cancer, e.g., normal - cancer), so a negative result
> indicates upregulation in cancer. This might be a bit counterintuitive,
> so you could simply replace your call to lmFit() with
>
> fit <- lmFit(~factor(rep(1:2, each = 3), levels=2:1)
>
> which will re-order things so coefficient 2 will be cancer - normal.
>
> Best,
>
> Jim
>
>
>> Any help appreciated on this, thank you.
>>
>> Kind regards,
>>
>> John.
>>
>>
>> =================================
>> Bioinformatics Officer
>>
>> Molecular Angiogenesis group
>> First Floor, Institute of Biomedical Research
>> The University of Birmingham
>> Medical School
>> Edgbaston
>> Birmingham
>> B15 2TT (UK)
>>
>> j.m.herbert at bham.ac.uk
>>
>> Tel: +44 121 414 3733
>> Fax: +44 121 415 8677
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
More information about the Bioconductor
mailing list