[BioC] edgeR dataset filtering using pnas_expression.txt

Wed Jan 4 15:19:25 CET 2012

Hi Dave

Dave Tang scripsit 01/04/2012 03:04 PM:
> Hi list,
>
> Just a question regarding edgeR and dataset processing/filtering prior
> to calling differential expression.
>
> Case Study 12 (RNA-seq of Hormone-Treated LNCaP Cells) from the edgeR
> manual mentions that:
>
> "We filter out lowly expressed tags and those which are only expressed
> in a small number of samples. We keep only those tags that have at least
> one count per million in at least three samples."
>
> Then in section 6 of the manual it mentions that:
>
> "The edgeR methodology needs to work with the original digital
> expression counts, so these should not be transformed in any way by
> users prior to analysis. edgeR automatically takes into account the
> total size (total read number) of each library in all calculations of
> fold-changes, concentration and statistical significance."
>
> My question is whether filtering counts as "transforming" the data.
> Since this would affect the total size of each library and thus
> affecting all downstream calculations, is it OK to use such filters?

Typically, such filtering as suggested by the edgeR manual cited above 
has negligible impact on size factor and dispersion estimates, yet by 
doing away with lots of gene-by-gene tests that never have a chance of 
being rejected anyway, it will improve your statistical power 
experiment-wide.

If your data were peculiar enough that the filtering would affect size 
factor or dispersion estimation, then you would have a problem. To 
address that, you would need to look more closely at data QA/QC and your 
overall analytical strategy.

Some more on filtering is here:
- http://www.pnas.org/content/107/21/9546.long (Bourgon et al., PNAS 2010)
- Section 5 "Independent filtering" in the vignette of a recent DESeq 
package (e.g. version >= 1.7.3)

	Best wishes
	Wolfgang.

> And
> what should one be cautious about when applying such filters e.g. at
> least n tags in n samples, prior to performing the edgeR analysis?
>
> Many thanks,
>

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber