[BioC] edgeR cpm filtering
John [guest]
guest at bioconductor.org
Mon Feb 11 17:54:54 CET 2013
All,
I am a new edgeR user. I have difficulty understanding the meaning of the âcpmâ function of edgeR package. I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I donât understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M?
Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example:
Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2
Gene_X, 150,100, 270,320,0,0
I used:
d_DGEList <- d_DGEList[rowSums(cpm_filtered > 5) > 2,]
But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this?
Thank you.
John
-- output of sessionInfo():
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_2.6.0 limma_3.12.0
>
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list