[BioC] genefilter vs limma - many probes filtered

Gordon K Smyth smyth at wehi.EDU.AU
Sun May 25 02:20:40 CEST 2014


Dear Marcin,

Variance filtering should not be used at any stage of the limma analysis. 
You are right to be worried by it.  The Bioc posts you mention from 2009 
and 2012 were about filtering by expression level, not by variance.

Variance filtering has only been shown to be valid and beneficial when 
using ordinary t-tests.  But greater benefits can be had by using the 
limma empirical Bayes t-test and filtering by expression.

If you think that very small or very large variances are an issue with 
your data, then you could discount them in a statistically valid way by 
using the robust option of the eBayes() function in limma.  Again this 
will give greater benefits than ad hoc filtering by observed variances.

Apart from the fact that variance filtering invalidates the limma 
algorithm (or any empirical Bayes algorithm), it also worries me that 
variance filtering lacks a good biological interpretation, whereas 
filtering by mean expression has the clear interpretation of removing 
genes that are not at worthwhile expression levels.

Best wishes
Gordon


> Date: Fri, 23 May 2014 13:22:23 +0200
> From: Marcin Jakub Kami?ski <marcinjakubkaminski at gmail.com>
> To: Ryan <rct at thompsonclan.org>
> Cc: genefilter Maintainer <maintainer at bioconductor.org>,
> 	bioconductor at r-project.org
> Subject: Re: [BioC] genefilter vs limma - many probes filtered
>
> Hello Ryan,
> thanks for your clear elucidation on this.
> Shame to admit, but after performing some additional reading I believe that
> my question should (at least partially) have never been asked - in limma
> guide it's advised to filter-out low intensities rather than low variances
> and more details can be found in this discussion:
> https://stat.ethz.ch/pipermail/bioconductor/2013-June/053071.html, which in
> fact agrees with your response.
> However, I'm still unable to find any straightforward answer to the
> question about filtering by variance after the eBayes() procedure (
> https://stat.ethz.ch/pipermail/bioconductor/2012-March/043895.html,
> https://stat.ethz.ch/pipermail/bioconductor/2009-October/030062.html).
> Also, I'm still worried about such 'beneficial' change after extensive
> filtering, especially as I didn't found any cases, when >50% of genes have
> been filtered.
>
> Best regards,
> Marcin
>
>
>
> On Fri, May 23, 2014 at 5:33 AM, Ryan <rct at thompsonclan.org> wrote:
>
>> Hi Marcin,
>>
>> I believe that performing variance filtering is not compatible with the
>> empirical Bayes methods employed in limma. The point of limma is to compute
>> a moderated estimate of each gene's variance by using the average variance
>> across all genes as a prior estimate. If you filter out genes based on
>> their variance, then you will bias that prior estimate, and this bias will
>> propagate to the posterior estimates. For example, if you filter out
>> high-variance genes, limma will underestimate the prior variance, and
>> overestimate the significance of your differential expression calls, which
>> is not a desirable outcome.
>>
>> It may possibly be defensible to perform variance filtering after the
>> empirical Bayes step, but I'm not sure, and you would have to ask someone
>> more knowledegable about such matters.
>>
>> -Ryan
>>
>>
>> On Thu May 22 18:41:24 2014, Marcin Kaminski [guest] wrote:
>>
>>> Dear list,
>>> I've followed the tips regarding gene filtering at
>>> http://www.bioconductor.org/packages/release/bioc/
>>> vignettes/genefilter/inst/doc/independent_filtering.pdf when analyzing
>>> GEO data (GSE48060). In this case most probes would pass the tests (for
>>> adj.p. < .05) if I filter out roughly 70% of them based on variance, which
>>> will triple the number of positives compared to not filtering at all.
>>> (related graphic: http://i.imgur.com/RuuvRIo.png)
>>> Should I be concerned about such extensive filtering? Does it affect
>>> further analysis with limma and introduce bias? If it's a problem, what are
>>> the available solutions or diagnostics?
>>>
>>> Thanks for your help!
>>>
>>> Best regards,
>>> Marcin
>>>
>>>
>>>   -- output of sessionInfo():
>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250
>>>  LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C
>>> [5] LC_TIME=Polish_Poland.1250
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> base
>>>
>>> other attached packages:
>>>   [1] RColorBrewer_1.0-5    hgu133plus2.db_2.14.0 org.Hs.eg.db_2.14.0
>>> RSQLite_0.11.4        DBI_0.2-7             AnnotationDbi_1.26.0
>>>   [7] GenomeInfoDb_1.0.2    genefilter_1.46.1     matrixStats_0.8.14
>>>  limma_3.20.3          GEOquery_2.30.0       Biobase_2.24.0
>>> [13] BiocGenerics_0.10.0
>>>
>>> loaded via a namespace (and not attached):
>>>   [1] annotate_1.42.0   IRanges_1.22.6    R.methodsS3_1.6.1
>>> RCurl_1.95-4.1    splines_3.1.0     stats4_3.1.0      survival_2.37-7
>>> tools_3.1.0
>>>   [9] XML_3.98-1.1      xtable_1.7-3

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list