[BioC] total count filter cutoff (edgeR)

Gordon K Smyth smyth at wehi.EDU.AU
Fri May 2 01:11:29 CEST 2014


Hi Mahnaz,

Why don't you follow the advice of the edgeR User's Guide (as Mark has 
suggested)?  All the case studies in the User's Guide describe how the 
filtering was done in a principled way.

Total count filtering is not so bad, but it is susceptible to being driven 
by one library, especially by one library with a large sequence depth. 
The procedure described by Mark and used in the guide is a compromise of 
several considerations.

BTW, there are newer versions of R and edgeR available than what you are 
using.

Best wishes
Gordon


> Date: Wed, 30 Apr 2014 21:34:50 +0200
> From: Mark Robinson <mark.robinson at imls.uzh.ch>
> To: "Ryan C. Thompson" <rct at thompsonclan.org>
> Cc: bioconductor at r-project.org, Mahnaz Kiani <mahnazkiani at gmail.com>
> Subject: Re: [BioC] total count filter cutoff
>
>
> In my lab, we typically follow a "CPM of at least X in at least Y 
> samples" rule, where X=1 (arbitrary but reasonable, can be changed) and 
> Y=size of smallest replicate group, according to one of the case studies 
> in the user's guide, for example:
>
> ------
> 4.3.6 Filtering

> We filter out very lowly expressed tags, keeping genes that are 
> expressed at a reasonable level in at least one treatment condition. 
> Since the smallest group size is three, we keep genes that achieve at 
> least one count per million (cpm) in at least three samples:
>
>> keep <- rowSums(cpm(y)>1) >= 3
>> y <- y[keep,]
> ------
>
> (http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf)
>
> Cheers, Mark
>
>
> ----------
> Prof. Dr. Mark Robinson
> Statistical Bioinformatics, Institute of Molecular Life Sciences
> University of Zurich
> http://ow.ly/riRea


> Date: Wed, 30 Apr 2014 11:29:28 -0700 (PDT)
> From: "mahnaz Kiani [guest]" <guest at bioconductor.org>
> To: bioconductor at r-project.org, mahnazkiani at gmail.com
> Subject: [BioC] total count filter cutoff
>
>
> I'm using edgeR for analysis of may data and I'm not sure what total 
> count filter value cutoff value I should use, My reads are paired 50bP 
> reads and total reads per sample is about 80,000,000. I tried cutoff 
> values of 5,10,15,30,50 and 100 and I only saw differences between 50 
> and 100 but still looking for logical reason to chose the cutoff value.
>
> Appreciate your help,
> Mahnaz
>
> -- output of sessionInfo():
>
> R 3.0.2

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list