[BioC] Too many DEG with edgeR output?
Mark Robinson
mark.robinson at imls.uzh.ch
Tue Oct 22 08:38:01 CEST 2013
Hi Bernardo,
A couple quick notes.
1. Filtering
> keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one
> count per million (cpm) in at least three samples
Despite what the comment says, this keeps only features where CPM=1 or higher in BOTH of your samples. But, maybe you'd be interested in features that are sufficiently high in 1 condition and more or less absent in the other ? Using your filter, these are removed.
2. Statistics
> bcv <- 0.2 #Assigned dispersion value of 0.2
I'm assuming this is just a random guess as to what the real dispersion is.
One suggestion in the User's guide is this: "Be satised with a descriptive analysis, that might include an MDS plot and an analysis of fold changes. Do not attempt a significance analysis. This may be the best advice."
(2.9 What to do if you have no replicates; http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)
If you feel that you have "too many" DE genes (I'm not sure exactly what you mean and/or how you know this), you could increase the dispersion, but overall, I wouldn't read too much into the statistics (i.e. P-values), if you don't have replicates.
Best, Mark
On 21.10.2013, at 11:23, Bernardo Bello <popnard at gmail.com> wrote:
> Hi there,
>
> I'm dealing with bacterial RNA-seq analysis. My experiment is very simple.
> Two samples to compare and no replicates. Reads were generated with Ion
> Torrent PGM using 316 chip. One for each sample and performed in different
> days.
>
> Since I had too many differentially expressed genes, ¿Should I be more
> conservative assigning edgeR dispersion value? Also, there are considerable
> more up-regulated genes in *exvivo* than in plate sample.
>
> See logFC_vs_logCPM figure:
>
> https://docs.google.com/file/d/0B8-ZAuZe8jldY0Y5SWJSdXh1WGc/edit?usp=sharing
>
> Thanks for you help, Bernardo
>
>
>
> - tmap code:
>
> tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam --verbose
>
> tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose
>
> - Flasgstat:
>
> exvivo:
>
>> 3240242 + 0 in total (QC-passed reads + QC-failed reads)
>
>
>> 2132481 + 0 mapped (65.81%:nan%)
>
> plate:
>
>> 3774075 + 0 in total (QC-passed reads + QC-failed reads)
>
>
>> 3510438 + 0 mapped (93.01%:nan%)
>
>
> - count:
>
> python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID
> exvivo.sam HPNK.gff > exvivo.counts
>
> python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID
> plate.sam HPNK.gff > plate.counts
>
>
> - count stats:
>
> ex-vivo stats
>
>> no_feature 777946
>
>> ambiguous 1
>
>> too_low_aQual 0
>
>> not_aligned 1107761
>
>> alignment_not_unique 0
>
> plate stats
>
>> no_feature 776707
>
>> ambiguous 47
>
>> too_low_aQual 0
>
>> not_aligned 263637
>
>> alignment_not_unique 0
>
> - edgeR code:
>
> library(edgeR)
>
> files <- dir(pattern="*\\.counts$")
>
> RG <- readDGE(files, header=FALSE)
>
> RG
>
> keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one
> count per million (cpm) in at least three samples
>
> RG <- RG[keep,]
>
> dim(RG)
>
>
> RG <- calcNormFactors(RG)
>
> RG$samples
>
>
> plotMDS(RG)
>
>
> bcv <- 0.2 #Assigned dispersion value of 0.2
>
> m <- as.matrix(RG)
>
> d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample
> number. Also can be adapted to replicated samples, see'?DGEList'.
>
> d
>
> et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG,
> pair=(1:2),dispersion=bcv^2)
>
> et
>
> top <- topTags(et)
>
> top
>
> cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes:
>
> summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR
> is given by'decideTestsDGE'.
>
>
> [,1]
>
> -1 200
>
> 0 1176
>
> 1 769
>
> Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and 200
> are down-regulated.
>
> detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5%
> FDR
>
> plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5%
> FDR
>
> abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes.
>
> title("plate vs ex-vivo")
>
> dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf##
>
> --
>
> *Bernardo Bello Ortí*
>
> PhD student
>
> CReSA-IRTA
>
> Campus de Bellaterra-Universitat Autònoma de Barcelona
>
> Edifici CReSA
>
> 08193 Bellaterra (Barcelona, Spain)
>
> Tel.: 647 42 52 63 *www.cresa.es *
>
> *
> *
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
----------
Prof. Dr. Mark Robinson
Bioinformatics, Institute of Molecular Life Sciences
University of Zurich
http://tiny.cc/mrobin
More information about the Bioconductor
mailing list