[BioC] total count filter cutoff

Ryan C. Thompson rct at thompsonclan.org
Wed Apr 30 22:11:57 CEST 2014


Filtering on raw counts has a statistical motivation, i.e. something 
like "we can't do statistics with less than X reads". Filtering on CPM 
is sometimes just used as a proxy for count-based filtering, but 
sometimes it also has a biological motivation, i.e. "we believe that 
CPM < X represents biological noise transcription rather than genuine 
regulated transcription relevant to the biological system in question". 
So you have to consider what your goals are for filtering and choose an 
appropriate method.

-Ryan

On Wed 30 Apr 2014 01:04:03 PM PDT, Aaron Mackey wrote:
> this is perhaps obvious to some, but I've seen colleagues surprised by
> it nonetheless: if each sample has been sequenced to a depth of ~20
> million reads, then with cpm >= 1, you're effectively/approximately
> requiring raw counts >= 20; if your depth is 100 million reads, then
> you're requiring counts > 100 (and presumably the whole reason you
> paid for 100 million reads was to get larger dynamic range at the low
> end, which you've just thrown away).  That "1 cpm rule of thumb" seems
> to be pervasive, and often used without thought to library size and
> dynamic range.  We did want to try to be better than microarrays, right?
>
> So, is there a disadvantage for filtering based on "raw count >= X
> (where X is 5, 10, etc.) in at least Y samples" rather than CPM?  Or
> would you suggest in such cases still normalizing by read depth but
> lowering the threshold (e.g. cpm >= 1/(mean lib. size in millions)).
>  I'm assuming non-pathological cases of fairly homogenous library size
> per sample.
>
> -Aaron
>
>
> On Wed, Apr 30, 2014 at 3:34 PM, Mark Robinson
> <mark.robinson at imls.uzh.ch <mailto:mark.robinson at imls.uzh.ch>> wrote:
>
>
>     In my lab, we typically follow a "CPM of at least X in at least Y
>     samples" rule, where X=1 (arbitrary but reasonable, can be
>     changed) and Y=size of smallest replicate group, according to one
>     of the case studies in the user's guide, for example:
>
>     ------
>     4.3.6 Filtering
>     We filter out very lowly expressed tags, keeping genes that are
>     expressed at a reasonable level in at least one treatment
>     condition. Since the smallest group size is three, we keep genes
>     that achieve at least one count per million (cpm) in at least
>     three samples:
>
>     > keep <- rowSums(cpm(y)>1) >= 3
>     > y <- y[keep,]
>     ------
>
>     (http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)
>
>     Cheers, Mark
>
>
>     ----------
>     Prof. Dr. Mark Robinson
>     Statistical Bioinformatics, Institute of Molecular Life Sciences
>     University of Zurich
>     http://ow.ly/riRea
>
>
>
>
>
>
>
>     On 30.04.2014, at 21:23, "Ryan C. Thompson" <rct at thompsonclan.org
>     <mailto:rct at thompsonclan.org>> wrote:
>
>     > Dear Mahnaz,
>     >
>     > Total count filtering and mean count filtering are equivalent,
>     since the only difference is a constant factor (dividing by number
>     of samples), so the mean count filter demonstrated in the
>     genefilter vignette corresponds to your question.
>     >
>     > If you are expecting the vignette to simply give you a specific
>     number to use a as a cutoff, that's not possible, because the
>     threshold depends on the data. I suggest that you adapt the R code
>     in this vignette to your data in order to choose an appropriate
>     cutoff.
>     >
>     > -Ryan
>     >
>     > On Wed 30 Apr 2014 12:04:33 PM PDT, Mahnaz Kiani wrote:
>     >> Thanks for quick response, I did check that but didn't find any
>     information
>     >> about total count filter cutoff, would you please help me with
>     that.
>     >>
>     >> Thanks,
>     >> Mahnaz
>     >>
>     >>
>     >> On Wed, Apr 30, 2014 at 1:47 PM, Wolfgang Huber <whuber at embl.de
>     <mailto:whuber at embl.de>> wrote:
>     >>
>     >>> Dear Mahnaz
>     >>>
>     http://bioconductor.org/packages/release/bioc/html/genefilter.html ->
>     >>> Diagnostics for independent filtering -> Section 4 provides
>     some options.
>     >>>         Wolfgang
>     >>>
>     >>> Il giorno 30 Apr 2014, alle ore 20:29, mahnaz Kiani [guest] <
>     >>> guest at bioconductor.org <mailto:guest at bioconductor.org>> ha
>     scritto:
>     >>>
>     >>>>
>     >>>> I'm using edgeR for analysis of may data and I'm not sure
>     what total
>     >>> count filter value cutoff value I should use, My reads are
>     paired 50bP
>     >>> reads and total reads per sample is about 80,000,000. I tried
>     cutoff values
>     >>> of 5,10,15,30,50 and 100 and I only saw differences between 50
>     and 100 but
>     >>> still looking for logical reason to chose the cutoff value.
>     >>>>
>     >>>> Appreciate your help,
>     >>>> Mahnaz
>     >>>>
>     >>>> -- output of sessionInfo():
>     >>>>
>     >>>> R 3.0.2
>     >>>>
>     >>>> --
>     >>>> Sent via the guest posting facility at bioconductor.org
>     <http://bioconductor.org>.
>     >>>>
>     >>>> _______________________________________________
>     >>>> Bioconductor mailing list
>     >>>> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>     >>>> Search the archives:
>     >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>     >>>
>     >>>
>     >>
>     >>      [[alternative HTML version deleted]]
>     >>
>     >> _______________________________________________
>     >> Bioconductor mailing list
>     >> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>     >> Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>     >
>     > _______________________________________________
>     > Bioconductor mailing list
>     > Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     > https://stat.ethz.ch/mailman/listinfo/bioconductor
>     > Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>     _______________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>     Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list