[BioC] Expected number of DE genes?
Jessica Perry Hekman
hekman2 at illinois.edu
Wed Jul 16 02:11:52 CEST 2014
On 07/15/2014 05:32 PM, Steve Lianoglou wrote:
> We can only help you to answer the last point, but to do so we will
> need to see the code you used and an explanation of your dataset (ie.
> which samples are which condition).
>
> Also: why do you expect to get upwards of a thousand differentially
> expressed genes? Have you (or others) done this experiment before and
> verified those results, or?
Steve -- thanks for your feedback. This is my first time doing RNA-Seq
and I had been under the impression that there would be a large number
of DE genes just by chance in most experiments. I think to a large
extent I just need to adjust my expectations. I just don't have a feel
for what normal results look like!
However, as you asked whether this had been done before, I sheepishly
recalled that we do have some analysis from a different lab using
different tissues from the same animals and a slightly different
approach to dealing with the data (de novo transcriptome instead of
aligning to dog). It turns out that they found 350-650 DE genes,
depending on the tissue. So they did get more DE genes than I did.
> What does the enrichment of reads that align to exons vs. intergenic
> reads look like?
I haven't done this analysis, but I'll take a look and see what it looks
like. I'm guessing that I should expect 80-90% of reads to align to
exons, since this is RNA?
You suggested that seeing my code might be helpful, so here it is, in
case you're curious:
For EdgeR:
x <- read.delim("cf3ncbi.genes.matrix", row.names="GeneName")
group <- factor(c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2))
y <- DGEList(counts=x,group=group)
y <- calcNormFactors(y)
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)
et <- exactTest(y)
tt <- topTags(et, n=100)
write.table(tt, file = "edgeR-cf3ncbi-topTags100.txt")
...where cf3ncbi.genes.matrix looks like:
GeneName 1_481 2_124 3_23 4_124 5_125 6_126 7_127 8_128 9_129
10_130 11_131 12_132 49_169 50_170 51_171 52_172 53_173 54_174 55_535
56_176 57_177 58_178 59_179 60_180
A1BG 27 39 55 41 50 42 33 66 63 31 50 34 52 53 34 44 35 20 45 43 75 47 43 45
A1CF 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
and so forth (as suggested in the code, the first 12 samples are from
one treatment group and the second 12 samples are from another).
For DESeq2:
sample.table <- data.frame(read.table("DESeq2SampleTable.txt", header=TRUE))
dds <- DESeqDataSetFromHTSeqCount(sample.table, directory = "cf3ncbi/",
formula(~ type))
dds2 <- DESeq(dds)
res <- results(dds2)
write.table (res, "cf3ncbi.deseq2.results.csv")
sum( res$padj < 0.05, na.rm=TRUE)
...where DESeq2SampleTable.txt looks like:
sampleName sampleFile type
1_481 cf3ncbi-1_481.genes.counts tame
2_124 cf3ncbi-2_124.genes.counts tame
...
Thanks,
Jessica
More information about the Bioconductor
mailing list