[BioC] edgeR on differentially expressed genes with low read counts

Naomi Altman naomi at stat.psu.edu
Tue Nov 29 18:33:34 CET 2011


You should filter out the genes with extremely 
low total counts as you will not have enough power to achieve significance.

Naomi

At 08:54 AM 11/29/2011, Lucía [guest] wrote:

>Hi,
>I am using the package edgeR to determine 
>differentially expressed genes in RNA-seq data. 
>I have 8 samples, corresponding to 3 patients (2 
>conditions per patient and 2 patients 
>duplicated). I performed the contrast of these 
>two conditions, following the user guide 
>(section 11 Case study: Oral carcinomas vs 
>matched normal tissue). When analyzing some 
>border line differentially expressed genes (FDR 
>~ 0.02) I found that in some cases, the read 
>counts was really low, e.g. only one sample with 
>2 reads, and the others (7) with 0 counts.
> > 
> countsTable[rownames(countsTable)=="ENSG00000207696",grep("CT",colnames(countsTable))]
>                 61_CT.poli 61_CT.tot 67_CT.poli 
> 67_CT.tot 70_CT.poli 70_CT.tot 61m2_CT.tot 67m2_CT.tot
>ENSG00000207696          0         0          0 
>        0          2         0           0           0
>
>The design matrix used was:
> >       design
>   patientsCT61 patientsCT67 patientsCT70 
> as.factor(1 - (as.numeric(groupsCT) - 1))1
>1            1            0            0 
>                                  0
>2            0            1            0 
>                                  0
>3            0            0            1 
>                                  0
>4            1            0            0 
>                                  0
>5            0            1            0 
>                                  0
>6            1            0            0 
>                                  1
>7            0            1            0 
>                                  1
>8            0            0            1 
>                                  1
>
>The result for the gene in question was:
> > tab.DGEensCT[rownames(tab.DGEensCT)=="ENSG00000207696",]
>                 table.logConc table.logFC 
> table.PValue table.FDR adjust.method                                 comparison
>ENSG00000207696        -18.34       7.726 
>0.005222   0.03941            BH as.factor(1 - (as.numeric(groupsCT) - 1))1
>
>When investigating the result of the gene, I 
>found that the glm-fitted values looked like this:
>
> > 
> format((glmfit.DGEensCT$fitted.values)[rownames(glmfit.DGEensCT$fitted.values)=="ENSG00000207696",],scientific=FALSE,digits=4)
>    61_CT.tot    67_CT.tot    70_CT.tot 
> 61m2_CT.tot  67m2_CT.tot   61_CT.poli   67_CT.poli   70_CT.poli
>"0.00008988" "0.00004048" "0.06229493" 
>"0.00026943" "0.00014959" "0.03479503" "0.03522840" "1.84247667"
>
>Since I was not able to find the details used in 
>edgeR to calculate the model easily, I was 
>wondering if from this glm-fitted values is it 
>reasonable to obtain such a low FDR?
>I used the common dispersion for the variance estimate:
>DGEensCT.CR<-estimateCRDisp(DGEensCT,design)
>glmfit.DGEensCT <- glmFit(DGEensCT, 
>design,dispersion = DGEensCT.CR$CR.common.dispersion)
>lrt.DGEensCT <- glmLRT(DGEensCT, glmfit.DGEensCT)
>
>Could someone help me with this?
>Thank you very much in advanced!
>
>Cheers,
>Lucía
>
>  -- output of sessionInfo():
>
>R version 2.12.0 (2010-10-15)
>Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
>locale:
>[1] C
>
>attached base packages:
>[1] stats     graphics  grDevices utils     datasets  methods   base
>
>other attached packages:
>[1] biomaRt_2.6.0 edgeR_2.0.5
>
>loaded via a namespace (and not attached):
>[1] RCurl_1.4-3 XML_3.2-0   limma_3.6.9
>
>--
>Sent via the guest posting facility at bioconductor.org.
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at r-project.org
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list