[BioC] edgeR cpm filtering
James W. MacDonald
jmacdon at uw.edu
Mon Feb 11 18:52:13 CET 2013
Hi John,
On 2/11/2013 11:54 AM, John [guest] wrote:
> All,
>
> I am a new edgeR user. I have difficulty understanding the meaning of the ‘cpm’ function of edgeR package. I mean I understand that each value is divided by the total library value, and then multiplied by 1,000,000. But why 1M? I don’t understand what is the logic behind using 1M? is it 1M reads? Or bases? And why not 10M? or 1000? Any specific reason for using 1M?
It's reads. You are inputting read counts per gene, so by definition
edgeR knows nothing about bases. As for why 1M, I don't know for sure,
but would imagine it is because a 'reasonable' sized RNA-Seq experiment
is based on somewhere around 10M reads or so.
In other words, say you have a sample with 10M reads, and one gene has
60 reads that align. You would have 6 cpm, which looks pretty low, and
it is. However you could do statistics on it based on a discrete
distribution like the negative binomial.
If you used 10M to normalize, the cp10M would be 0.6, so you would have
to throw that gene out. If you used cpk, it would be 6000 cpk and it
would look like you had lots of reads when in fact you only had 60.
So cpm looks like a reasonable compromise to me.
>
> Another issues that I have is that how can I enforce filtering the samples that have 0 reads in one group of samples, but very large number of reads in another groups? Here is an example:
>
> Samples, Sample 1-replicate 1, Sample 1-replicate 2, Sample 2-replicate 1, Sample 2- replicate 2, Sample 3-replicate 1, Sample 3- replicate 2
> Gene_X, 150,100, 270,320,0,0
>
> I used:
>
> d_DGEList<- d_DGEList[rowSums(cpm_filtered> 5)> 2,]
>
> But still Gene_X is not filtered. Many genes with low number of reads are filtered, but very few like Gene_X are still there. I think that having many reads mapped to samples 1 and 2 qualifies it for passing the cpm filtering. How can I filter genes like this? Is it OK if I manually delete cases like this?
Setting aside the logic of removing these genes (which IMO is missing
the point of looking for differential expression), note the logic of
your filter. Let's break it down by section:
rowSums(cpm_filtered > 5)
This gives the count of samples where the cpm value is > 5. In the case
you mention, this would be four.
rowSums(cpm_filtered > 5) > 2
This results in TRUE if the count of samples for a given gene with a cpm
value > 5 is greater than two. So you are saying that at least two
samples have to have a cpm > 5. In the instance you want to filter, the
count is 4, and 4 > 2, so this passes the filter.
So what you apparently want are rows where the cpm is greater than some
value in ALL samples, not just two of them, so you would want to change
the > 2 part to == 6.
Note that this doesn't really make any sense. You are in effect saying
that you are completely uninterested in any genes that appear not to be
expressed in one of your samples, but that might be highly expressed in
one or more of the other samples. That to me is actually really
interesting, and I am not sure why you would want to filter out any gene
that fulfills that criterion.
Best,
Jim
>
> Thank you.
> John
>
>
> -- output of sessionInfo():
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
>
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
>
> locale:
>
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
>
> attached base packages:
>
> [1] stats graphics grDevices utils datasets methods base
>
>
> other attached packages:
>
> [1] edgeR_2.6.0 limma_3.12.0
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list