[BioC] edgeR dataset filtering using pnas_expression.txt
Wolfgang Huber
whuber at embl.de
Wed Jan 4 15:19:25 CET 2012
Hi Dave
Dave Tang scripsit 01/04/2012 03:04 PM:
> Hi list,
>
> Just a question regarding edgeR and dataset processing/filtering prior
> to calling differential expression.
>
> Case Study 12 (RNA-seq of Hormone-Treated LNCaP Cells) from the edgeR
> manual mentions that:
>
> "We filter out lowly expressed tags and those which are only expressed
> in a small number of samples. We keep only those tags that have at least
> one count per million in at least three samples."
>
> Then in section 6 of the manual it mentions that:
>
> "The edgeR methodology needs to work with the original digital
> expression counts, so these should not be transformed in any way by
> users prior to analysis. edgeR automatically takes into account the
> total size (total read number) of each library in all calculations of
> fold-changes, concentration and statistical significance."
>
> My question is whether filtering counts as "transforming" the data.
> Since this would affect the total size of each library and thus
> affecting all downstream calculations, is it OK to use such filters?
Typically, such filtering as suggested by the edgeR manual cited above
has negligible impact on size factor and dispersion estimates, yet by
doing away with lots of gene-by-gene tests that never have a chance of
being rejected anyway, it will improve your statistical power
experiment-wide.
If your data were peculiar enough that the filtering would affect size
factor or dispersion estimation, then you would have a problem. To
address that, you would need to look more closely at data QA/QC and your
overall analytical strategy.
Some more on filtering is here:
- http://www.pnas.org/content/107/21/9546.long (Bourgon et al., PNAS 2010)
- Section 5 "Independent filtering" in the vignette of a recent DESeq
package (e.g. version >= 1.7.3)
Best wishes
Wolfgang.
> And
> what should one be cautious about when applying such filters e.g. at
> least n tags in n samples, prior to performing the edgeR analysis?
>
> Many thanks,
>
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list