[BioC] edgeR on differentially expressed genes with low read counts
Gordon K Smyth
smyth at wehi.EDU.AU
Wed Nov 30 00:36:35 CET 2011
Dear Luca,
The edgeR function glmLRT() conducts a likelihood ratio test, a standard
procedure for generalized linear models. It is virtually impossible to
get the p-value that you give from the counts that you give, regardless of
the dispersion estimation, although the p-value does depend on the total
library sizes, which I can't determine from your email. So I would guess
that there is a code error somewhere in the analysis that you've used,
perhaps the different data tables have got out of sync during the analysis
somehow. I can see that the design matrix doesn't seem to agree with the
columns of the data matrix. For example, patient CT61 seems to be
associated with columns one, two and seven of your data, but is associated
with libraries one, four and six in your design matrix. There may be
other issues as well.
As Naomi has mentioned, I would recommend that any gene with very low
counts be filtered from the data table before analysis, because such genes
can never legitimately be significantly differentially expressed. For
your experiment, I would probably choose to keep genes with at least 1
count-per-million in at least three libraries. See the edgeR User's Guide
for more information about this. However this doesn't seem to your main
problem.
If you are still having a problem and need further advice, please give
complete uninterrupted code that takes you from the original data table to
topTags results.
Best wishes
Gordon
------------------- original message ----------------------
Luca [guest] guest at bioconductor.org
Tue Nov 29 14:54:51 CET 2011
Hi,
I am using the package edgeR to determine differentially expressed genes
in RNA-seq data. I have 8 samples, corresponding to 3 patients (2
conditions per patient and 2 patients duplicated). I performed the
contrast of these two conditions, following the user guide (section 11
Case study: Oral carcinomas vs matched normal tissue). When analyzing some
border line differentially expressed genes (FDR ~ 0.02) I found that in
some cases, the read counts was really low, e.g. only one sample with 2
reads, and the others (7) with 0 counts.
>
countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",colnames(countsTable))]
61_CT.poli 61_CT.tot 67_CT.poli 67_CT.tot 70_CT.poli
70_CT.tot 61m2_CT.tot 67m2_CT.tot
ENSG00000207696 0 0 0 0 2
0 0 0
The design matrix used was:
> design
patientsCT61 patientsCT67 patientsCT70 as.factor(1 -
(as.numeric(groupsCT) - 1))1
1 1 0 0
0
2 0 1 0
0
3 0 0 1
0
4 1 0 0
0
5 0 1 0
0
6 1 0 0
1
7 0 1 0
1
8 0 0 1
1
The result for the gene in question was:
> tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",]
table.logConc table.logFC table.PValue table.FDR
adjust.method comparison
ENSG00000207696 -18.34 7.726 0.005222 0.03941
BH as.factor(1 - (as.numeric(groupsCT) - 1))1
When investigating the result of the gene, I found that the glm-fitted
values looked like this:
>
format((glmfit.DGEensCT$fitted.values)[rownames(glmfit.DGEensCT$fitted.values)=="ENSG00000207696",],scientific=FALSE,digits=4)
61_CT.tot 67_CT.tot 70_CT.tot 61m2_CT.tot 67m2_CT.tot
61_CT.poli 67_CT.poli 70_CT.poli
"0.00008988" "0.00004048" "0.06229493" "0.00026943" "0.00014959"
"0.03479503" "0.03522840" "1.84247667"
Since I was not able to find the details used in edgeR to calculate the
model easily, I was wondering if from this glm-fitted values is it
reasonable to obtain such a low FDR?
I used the common dispersion for the variance estimate:
DGEensCT.CR<-estimateCRDisp(DGEensCT,design)
glmfit.DGEensCT <- glmFit(DGEensCT, design,dispersion =
DGEensCT.CR$CR.common.dispersion)
lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT)
Could someone help me with this?
Thank you very much in advanced!
Cheers,
Luca
-- output of sessionInfo():
R version 2.12.0 (2010-10-15)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.6.0 edgeR_2.0.5
loaded via a namespace (and not attached):
[1] RCurl_1.4-3 XML_3.2-0 limma_3.6.9
--
Sent via the guest posting facility at bioconductor.org.
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list