[BioC] A metric to determine best filtration in the limma package
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Sep 11 01:59:30 CEST 2012
Dear Mark,
I think that voom() should be pretty tolerant of the amount of filtering
that is done, so you can feel free to be more inclusive.
Note that our recommended filtering is
keep <- rowSums(cpm(dge) > k) >= X
where X is the sample size of the smallest group size. Since X is usually
smaller than half the number of arrays, our recommended filtering is
usually more inclusive than the filter you give.
You are also free to vary k, depending on your sequencing depth. The idea
is to filter low counts.
Best wishes
Gordon
-------------- original message -------------
[BioC] A metric to determine best filtration in the limma package
Aaron Mackey amackey at virginia.edu
Mon Sep 10 16:27:21 CEST 2012
Hello Bioconductor Gurus!
(I apologize if this goes through more than once)
We are currently using limma (through the voom() function) to analyze
RNA-seq data, represented as RSEM counts. We currently have 246 samples
(including replicates) and our design matrix has 65 columns.
My question is in regard to how much we should be filtering our data
before running it through the analysis pipeline. Our current approach is
to look for a CPM of greater than 2 in at least half of the samples. The
code is:
keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2)
This brings down our transcript count from 73,761 to less than 20,000.
While we do see groupings and batch effects we expect to see in the MDS
plots, we are afraid we might be filtering too severely.
So finally my question: What is a good metric for determining how well we
have filtered the data?
Thank you,
Mark Lawson, PhD
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list