[BioC] edgeR on differentially expressed genes with low read counts

Tue Nov 29 14:54:51 CET 2011

Hi, 
I am using the package edgeR to determine differentially expressed genes in RNA-seq data. I have 8 samples, corresponding to 3 patients (2 conditions per patient and 2 patients duplicated). I performed the contrast of these two conditions, following the user guide (section 11 Case study: Oral carcinomas vs matched normal tissue). When analyzing some border line differentially expressed genes (FDR ~ 0.02) I found that in some cases, the read counts was really low, e.g. only one sample with 2 reads, and the others (7) with 0 counts.
> countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",colnames(countsTable))]
                61_CT.poli 61_CT.tot 67_CT.poli 67_CT.tot 70_CT.poli 70_CT.tot 61m2_CT.tot 67m2_CT.tot
ENSG00000207696          0         0          0         0          2         0           0           0

The design matrix used was:
>  	design
  patientsCT61 patientsCT67 patientsCT70 as.factor(1 - (as.numeric(groupsCT) - 1))1
1            1            0            0                                          0
2            0            1            0                                          0
3            0            0            1                                          0
4            1            0            0                                          0
5            0            1            0                                          0
6            1            0            0                                          1
7            0            1            0                                          1
8            0            0            1                                          1

The result for the gene in question was:
> tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",]
                table.logConc table.logFC table.PValue table.FDR adjust.method                                 comparison
ENSG00000207696        -18.34       7.726     0.005222   0.03941            BH as.factor(1 - (as.numeric(groupsCT) - 1))1

When investigating the result of the gene, I found that the glm-fitted values looked like this:

> format((glmfit.DGEensCT$fitted.values)[rownames(glmfit.DGEensCT$fitted.values)=="ENSG00000207696",],scientific=FALSE,digits=4)
   61_CT.tot    67_CT.tot    70_CT.tot  61m2_CT.tot  67m2_CT.tot   61_CT.poli   67_CT.poli   70_CT.poli 
"0.00008988" "0.00004048" "0.06229493" "0.00026943" "0.00014959" "0.03479503" "0.03522840" "1.84247667"

Since I was not able to find the details used in edgeR to calculate the model easily, I was wondering if from this glm-fitted values is it reasonable to obtain such a low FDR?
I used the common dispersion for the variance estimate:
DGEensCT.CR<-estimateCRDisp(DGEensCT,design)
glmfit.DGEensCT <- glmFit(DGEensCT, design,dispersion = DGEensCT.CR$CR.common.dispersion)
lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT)

Could someone help me with this?
Thank you very much in advanced!

Cheers, 
LucÃa

 -- output of sessionInfo(): 

R version 2.12.0 (2010-10-15)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.6.0 edgeR_2.0.5  

loaded via a namespace (and not attached):
[1] RCurl_1.4-3 XML_3.2-0   limma_3.6.9

--
Sent via the guest posting facility at bioconductor.org.