[BioC] Expected number of DE genes?

Wed Jul 16 02:11:52 CEST 2014

On 07/15/2014 05:32 PM, Steve Lianoglou wrote:

> We can only help you to answer the last point, but to do so we will
> need to see the code you used and an explanation of your dataset (ie.
> which samples are which condition).
>
> Also: why do you expect to get upwards of a thousand differentially
> expressed genes? Have you (or others) done this experiment before and
> verified those results, or?

Steve -- thanks for your feedback. This is my first time doing RNA-Seq 
and I had been under the impression that there would be a large number 
of DE genes just by chance in most experiments. I think to a large 
extent I just need to adjust my expectations. I just don't have a feel 
for what normal results look like!

However, as you asked whether this had been done before, I sheepishly 
recalled that we do have some analysis from a different lab using 
different tissues from the same animals and a slightly different 
approach to dealing with the data (de novo transcriptome instead of 
aligning to dog). It turns out that they found 350-650 DE genes, 
depending on the tissue. So they did get more DE genes than I did.

> What does the enrichment of reads that align to exons vs. intergenic
> reads look like?

I haven't done this analysis, but I'll take a look and see what it looks 
like. I'm guessing that I should expect 80-90% of reads to align to 
exons, since this is RNA?

You suggested that seeing my code might be helpful, so here it is, in 
case you're curious:

For EdgeR:

x <- read.delim("cf3ncbi.genes.matrix", row.names="GeneName")
group <- factor(c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2))
y <- DGEList(counts=x,group=group)
y <- calcNormFactors(y)
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)
et <- exactTest(y)
tt <- topTags(et, n=100)
write.table(tt, file = "edgeR-cf3ncbi-topTags100.txt")

...where cf3ncbi.genes.matrix looks like:

GeneName	1_481	2_124	3_23	4_124	5_125	6_126	7_127	8_128	9_129 
10_130	11_131	12_132	49_169	50_170	51_171	52_172	53_173	54_174	55_535 
56_176	57_177	58_178	59_179	60_180
A1BG	27	39	55	41	50	42	33	66	63	31	50	34	52	53	34	44	35	20	45	43	75	47	43	45
A1CF	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

and so forth (as suggested in the code, the first 12 samples are from 
one treatment group and the second 12 samples are from another).

For DESeq2:

sample.table <- data.frame(read.table("DESeq2SampleTable.txt", header=TRUE))
dds <- DESeqDataSetFromHTSeqCount(sample.table, directory = "cf3ncbi/", 
formula(~ type))
dds2 <- DESeq(dds)
res <- results(dds2)
write.table (res, "cf3ncbi.deseq2.results.csv")
sum( res$padj < 0.05, na.rm=TRUE)

...where DESeq2SampleTable.txt looks like:

sampleName      sampleFile      type
1_481      cf3ncbi-1_481.genes.counts tame
2_124      cf3ncbi-2_124.genes.counts tame
...

Thanks,
Jessica