As Gordon Smyth and others have commented on, such FPKM normalization is
not necessary/useful if differential expression across treatments is the
goal.  Yes, this particular filter will have some bias against smaller
transcripts who naturally have fewer total counts, but one major intent of
the filter is not only to remove very low abundant transcripts for which
significant differences will be hard to detect, but also to remove those
transcripts with unnaturally high number of 0 counts (those being the
elements that most strongly violate the negative binomial dispersion
assumptions).  This particular "cpm(dge) > X" recipe comes right out of the
edgeR user guide.

But the question remains, how much filtering is "enough" vs. "too much".
 Anecdotally, I've seen other mailing-list postings about raising the
filtering until edgeR stops crashing (a rare event that seems to have been
fixed in the recent dev version, at a sacrifice in parallel-cpu speed).
 Using voom(), I imagine something like tuning the filtering until the
mean-variance profile stabilizes, but by what metric (except by eye) could
you measure this?

Thoughts?
-Aaron

On Thu, Sep 6, 2012 at 2:20 PM, Steve Lianoglou <
mailinglist.honeypot@gmail.com> wrote:

> Hi,
>
> On Thu, Sep 6, 2012 at 1:52 PM, Mark Lawson <mlawsonvt09@gmail.com> wrote:
> > Hello Bioconductor Gurus!
> >
> > (I apologize if this goes through more than once)
> >
> > We are currently using limma (through the voom() function) to analyze
> > RNA-seq data, represented as RSEM counts. We currently have 246 samples
> > (including replicates) and our design matrix has 65 columns.
> >
> > My question is in regard to how much we should be filtering our data
> before
> > running it through the analysis pipeline. Our current approach is to look
> > for a CPM of greater than 2 in at least half of the samples. The code is:
> >
> > keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2)
>
> I'm guessing you are using "normal" rna-seq data (ie. it's not a tag
> sequencing something), so just a quick thought (apologies in advance
> if I am misunderstanding your setup):
>
> If you are filtering by counts per million without normalizing for
> approximate length of your transcript (like an R/FPKM-like measure),
> aren't you biasing your filter (and, therefore, data)?
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

	[[alternative HTML version deleted]]