[BioC] edgeR cpm filtering
James W. MacDonald
jmacdon at uw.edu
Mon Feb 11 23:10:05 CET 2013
Hi John,
Please don't take things off-list. Even if you are not a subscriber (and
if you are using BioC stuff you should be, and you can always stop
delivery but remain a subscriber), I believe that replying to an
existing thread will work.
I don't see any zero counts causing a problem. Using the example for
cpm() as a starting point, and modifying to have a set with zero counts,
I get this:
> y
[,1] [,2] [,3] [,4]
[1,] 1 2 14 11
[2,] 11 25 1 26
[3,] 1 22 2 19
[4,] 5 6 15 6
[5,] 0 0 1 5
> d <-DGEList(counts=y, lib.size=1001:1004, group=factor(c(1,1,2,2)))
> d <- estimateCommonDisp(d)
> d <- estimateTagwiseDisp(d)
> topTags(exactTest(d))
Comparison of groups: 2-1
logFC logCPM PValue FDR
1 2.9550376 12.76964 6.109348e-05 0.0003054674
5 4.6421574 10.54712 1.283343e-01 0.3208358043
4 0.9149142 12.96222 2.668415e-01 0.4447357815
2 -0.4149407 13.93933 8.539261e-01 0.9783799675
3 -0.1325391 13.42121 9.783800e-01 0.9783799675
So the sample with zero counts (sample 5), is the second row in the
topTags() output, and it has no problem computing a logFC value.
Best,
Jim
On 2/11/2013 4:30 PM, John Sperry wrote:
> Hi again Jim,
>
> One more thing, in microarray days, people used to add a small value,
> let say 1 to the 0 values to avoid non-sense fold changes. It's not
> the case in NGS any more right? so it's not possible to do that in
> edgeR, right? that's why I was thinking about filtering out with cpm.
>
> Thanks,
> John
>
>
>
> ------------------------------------------------------------------------
> *From:* John Sperry <johnsperry51 at yahoo.com>
> *To:* "jmacdon at uw.edu" <jmacdon at uw.edu>
> *Sent:* Monday, February 11, 2013 1:47 PM
> *Subject:* [BioC] edgeR cpm filtering
>
> Hi Jim,
>
> I'm very new to edgeR and BioC. I couldn't reply to your post in BioC,
> so here is my post in an email :D
>
> I still cannot see why 1M is chosen, but I appreciate your explanations.
>
> About the cpm filtering, the reason that I chose '> 2' for 3 samples
> with each having 2 replicates was that I though edgeR must be smart
> enough to figure out that when I say more than 5 reads per million for
> more than 2 samples, it means for ALL the replicates of each samples!
> which apparently is not the case! thanks for pointing that out!
>
> as for the reason for wanting to get rid of the sample 3 with 2
> replicates that have 0 reads mapped to them, I don't want them,
> because, they cause the logFC to become huge non-sense numbers! i
> guess dividing be 0 causes the problem! so I thought for not seeing
> weird values when the significant genes are selected, it's better to
> get rid of genes that have 0 reads mapped to any of their groups. Does
> it make sense?
>
> d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,]
>
> Thanks,
> John
>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list