[BioC] Filtering out tags with low counts in DESeq and EgdeR?
Wolfgang Huber
whuber at embl.de
Wed May 25 19:35:15 CEST 2011
Just to summarise the main points here:
Filtering on overall counts (disregarding group labels) is admissible in
conjunction with DESeq and afaIcs edgeR: the type-I error is not
measurably affected.
There is benefit to it: more genes can be detected as differentially
expressed at the same type-I error (as measured e.g. by FDR).
The size of the benefit depends on the threshold used for the cutoff. I
do not see major objections to trying several cutoffs and "picking the
best". The size of the benefit tends to be smaller than with microarray
data. In my experience, one gets 10-20% more differentially expressed
genes. This is still not bad for something that is "free".
Why? The reason for the smaller size of the benefit is that those genes
that have zero counts everywhere are already implicitly filtered either
by people's data processing or by DESeq/edgeR. The number of genes that
have some non-zero counts but not enough to be significant seem to lie
in the range mentioned above (this will depend on coverage) - and it is
these for which the filtering provides benefit. With microarrays, both
of these classes of genes needed to be filtered explicitly.
Best wishes
Wolfgang
>
> Hi Martin,
>
> this is a very good question, and it needs more investigation. I'll have
> a closer look into this and report back in this place. Two comments
> though already:
>
> - That the dispersion parameter is estimated from the data need by
> itself not be problematic (It is not problematic in the microattay case
> that the t-test variances are estimated from the data.)
>
> - The situation with limma is different: there, the problem is that
> limma's Bayesian model, which assumes that gene-level error variances
> σ^2_1, ..., σ^2_m follow a scaled inverse χ2 distribution, no longer
> fits the data when the data are filtered for genes with low overall
> variance. Due to the way that limma is implemented, the posterior
> degrees-of-freedom estimate of that distribution then becomes infinite,
> gene-level variance estimates will be ignored (leading to an unintended
> analysis based on fold change only), and type I error rate control is
> lost. See Fig. 2B+C in the paper, and the text on p.3.
>
> So, what we need to do with DESeq is
> - simulate data according to the null model
> - see if & how filtering affects the estimated mean-dispersion relationship
> - see if & how it affects the type I error globally and locally (for
> particular ranges of the mean).
>
> Another point is how much the filtering improves power - that will be
> related to how many genes can actually be removed by the filtering step,
> which depends on the data.
>
> Best wishes
> Wolfgang
>
>
>
> Il May/21/11 4:36 PM, Martin Morgan ha scritto:
>> On 05/21/2011 07:07 AM, Wolfgang Huber wrote:
>>> Hi Xiaohui
>>>
>>> to follow up on the filtering question:
>>>
>>> - the filter that Xiaohui applied is invalid, it will distort the
>>> null-distribution of the test statistic and lead to invalid p-values.
>>> This might explain the discrepancy.
>>>
>>> - the filter that Simon suggested is OK and should provide better
>>> results.
>>>
>>> - I'd also be keen to hear about your experience with this.
>>>
>>> A valid filtering criterion does not change the null distribution of the
>>> subsequently applied test statistic (it can, and in fact should, change
>>> the alternative distribution(s)). In practice, this means choosing a
>>> filter criterion that is statistically independent, under the null, from
>>> the test statistic, and in particular, that it does not use the class
>>> labels. Details in the below-cited PNAS paper.
>>
>> Was wondering whether, since the dispersion parameter is estimated from
>> the data, in some strict sense the filtering and testing procedures are
>> not independent under the null anyway? For the same reason that one
>> would not want to use a variance filter before a limma-style analysis,
>> if I understand correctly.
>>
>> Martin
>>
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>
>>>
>>>
>>>
>>> Il May/21/11 11:02 AM, Simon Anders ha scritto:
>>>> Hi Xiaohui
>>>>
>>>> I agree thatit is worrying to get so different results from your two
>>>> approaches of using DESeq. Here are a few suggestion how you might
>>>> investigate this (and I'd be eager to hear about your findings):
>>>>
>>>> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre-filtering
>>>> affects the validity and power of a test. They stress that it is
>>>> important that the filter is blind to the sample labels (actually: even
>>>> permutation invariant). So what you do here is not statistically sound:
>>>>
>>>> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ]
>>>>
>>>> Try instead something like:
>>>>
>>>> filter=dat[rowSums(dat) >= 16, ]
>>>>
>>>> - How does your filter affect the variance functions? Do the plots
>>>> generated by 'scvPlot()' differ between the filtered and the unfiltered
>>>> data set?
>>>>
>>>> - If so, are the hits that you get at expression strength were the
>>>> variance functions differ? Are they at the low end, i.e., where the
>>>> filter made changes?
>>>>
>>>> - Have you tried what happens if you filter after estimating variance?
>>>> The raw p values should be the same as without filtering, but the
>>>> adjusted p values might get better.
>>>>
>>>> To be honest, I'm currently a bit at a loss which one is more correct:
>>>> Filtering before or after variance estimation. Let's hear what other
>>>> people on the list think.
>>>>
>>>>> 2. For EdgeR
>>>>
>>>> DESeq and edgeR are sufficiently similar that any correct answer
>>>> regarding filtering should apply to both.
>>>>
>>>>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after
>>>>> adjusting p.value, is this possible? Then, can I used the *unadjusted*
>>>>> p.value to get DE genes?
>>>>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
>>>>> "BH")< 0.05)
>>>>
>>>> Of course, this is possible. (Read up on the "multiple hypothesis
>>>> testing problem" if this is unclear to you.) Not also, though, that you
>>>> used an FDR of .1 in your DESeq code but of .05 here.
>>>>
>>>> Simon
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>>
>
>
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list