[BioC] Filtering out tags with low counts in DESeq and EgdeR?
Xiaohui Wu
wux3 at muohio.edu
Sat May 21 10:25:08 CEST 2011
Hello,
I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any.
I have some questions on how/when to filter out genes with low counts and I am also confused on the final results.
I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition.
The total counts in each replicate is as following:
cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4
1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892
1. For DESeq
I used all 50,000 genes to estimate size factor for normalization.
Then should I remove genes with low count before estimating variance or after that?
I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets.
1) no filtering before estimating variance
d <- newCountDataSet( dat, group )
d <- estimateSizeFactors( d)
sf=sizeFactors(d)
d <- estimateVarianceFunctions( d )
res <- nbinomTest( d,"cond1","cond2")
resSig <- res[ res$padj < .1, ]
2) filtering before estimating variance
filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes
filter.ds <- newCountDataSet( filter, group )
filter.ds$sizeFactor=sf ## use the sizefactor from 1)
filter.ds <- estimateVarianceFunctions( filter.ds )
filter.res <- nbinomTest( filter.ds,"cond1","cond2")
filter.sig=filter.res[ filter.res$padj < .1, ]
nrow(filter.sig)
Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2).
2. For EdgeR
1) when should I remove low count genes?
f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right?
f <- f/exp(mean(log(f)))
dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts?
d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f)
2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes?
To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05)
Any help or suggestions will be appreciated.
Thanks,
Xiaohui
More information about the Bioconductor
mailing list