[BioC] Too many DEG with edgeR output?

Tue Oct 22 08:38:01 CEST 2013

Hi Bernardo,

A couple quick notes.

1. Filtering

> keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one
> count per million (cpm) in at least three samples

Despite what the comment says, this keeps only features where CPM=1 or higher in BOTH of your samples.  But, maybe you'd be interested in features that are sufficiently high in 1 condition and more or less absent in the other ?  Using your filter, these are removed.

2. Statistics

> bcv <- 0.2 #Assigned dispersion value of 0.2

I'm assuming this is just a random guess as to what the real dispersion is.  

One suggestion in the User's guide is this: "Be satised with a descriptive analysis, that might include an MDS plot and an analysis of fold changes. Do not attempt a significance analysis. This may be the best advice."

(2.9 What to do if you have no replicates; http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)

If you feel that you have "too many" DE genes (I'm not sure exactly what you mean and/or how you know this), you could increase the dispersion, but overall, I wouldn't read too much into the statistics (i.e. P-values), if you don't have replicates.

Best, Mark

On 21.10.2013, at 11:23, Bernardo Bello <popnard at gmail.com> wrote:

> Hi there,
> 
> I'm dealing with bacterial RNA-seq analysis. My experiment is very simple.
> Two samples to compare and no replicates. Reads were generated with Ion
> Torrent PGM using 316 chip. One for each sample and performed in different
> days.
> 
> Since I had too many differentially expressed genes, ¿Should I be more
> conservative assigning edgeR dispersion value? Also, there are considerable
> more up-regulated genes in *exvivo* than in plate sample.
> 
> See logFC_vs_logCPM figure:
> 
> https://docs.google.com/file/d/0B8-ZAuZe8jldY0Y5SWJSdXh1WGc/edit?usp=sharing
> 
> Thanks for you help, Bernardo
> 
> 
> 
> - tmap code:
> 
> tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam --verbose
> 
> tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose
> 
> - Flasgstat:
> 
> exvivo:
> 
>> 3240242 + 0 in total (QC-passed reads + QC-failed reads)
> 
> 
>> 2132481 + 0 mapped (65.81%:nan%)
> 
> plate:
> 
>> 3774075 + 0 in total (QC-passed reads + QC-failed reads)
> 
> 
>> 3510438 + 0 mapped (93.01%:nan%)
> 
> 
> - count:
> 
>    python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID
> exvivo.sam HPNK.gff > exvivo.counts
> 
>    python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID
> plate.sam HPNK.gff > plate.counts
> 
> 
> - count stats:
> 
> ex-vivo stats
> 
>> no_feature 777946
> 
>> ambiguous 1
> 
>> too_low_aQual 0
> 
>> not_aligned 1107761
> 
>> alignment_not_unique 0
> 
> plate stats
> 
>> no_feature 776707
> 
>> ambiguous 47
> 
>> too_low_aQual 0
> 
>> not_aligned 263637
> 
>> alignment_not_unique 0
> 
> - edgeR code:
> 
> library(edgeR)
> 
> files <- dir(pattern="*\\.counts$")
> 
> RG <- readDGE(files, header=FALSE)
> 
> RG
> 
> keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one
> count per million (cpm) in at least three samples
> 
> RG <- RG[keep,]
> 
> dim(RG)
> 
> 
> RG <- calcNormFactors(RG)
> 
> RG$samples
> 
> 
> plotMDS(RG)
> 
> 
> bcv <- 0.2 #Assigned dispersion value of 0.2
> 
> m <- as.matrix(RG)
> 
> d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample
> number. Also can be adapted to replicated samples, see'?DGEList'.
> 
> d
> 
> et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG,
> pair=(1:2),dispersion=bcv^2)
> 
> et
> 
> top <- topTags(et)
> 
> top
> 
> cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes:
> 
> summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR
> is given by'decideTestsDGE'.
> 
> 
>   [,1]
> 
> -1  200
> 
> 0  1176
> 
> 1   769
> 
> Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and 200
> are down-regulated.
> 
> detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5%
> FDR
> 
> plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5%
> FDR
> 
> abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes.
> 
> title("plate vs ex-vivo")
> 
> dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf##
> 
> -- 
> 
> *Bernardo Bello Ortí*
> 
> PhD student
> 
> CReSA-IRTA
> 
> Campus de Bellaterra-Universitat Autònoma de Barcelona
> 
> Edifici CReSA
> 
> 08193  Bellaterra (Barcelona, Spain)
> 
> Tel.: 647 42 52 63 *www.cresa.es  *
> 
> *
> *
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

----------
Prof. Dr. Mark Robinson
Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://tiny.cc/mrobin