[BioC] PreFiltering probe in microarray analysis
Arno, Matthew
matthew.arno at kcl.ac.uk
Fri Jun 17 12:52:33 CEST 2011
I agree entirely - with all these points. They are all perfectly valid in defining an ideal or minimum standard for publication of microarray analyses in their entirety. Indeed that's what this is all about. I am not trying to convert anyone, or propose lowering standards.
My angle is just that sometimes either the study design or the data itself is not up to these standards (and is not intended to be). This is no bar to extracting some useful biological information from the data, as long as the limitations of this are fully recognised, i.e. you don't try and publish gene lists from a single-replicate array study in Cell, prefiltered to remove genes 'below 50' and with no multiple testing correction.
In running a core facility and analysing data from studies where money is tight (and getting tighter) I find myself in the position where there is no chance that data can meet the standards talked about here, but short of saying "go away, spend another £20,000 and don't waste my time" I usually try to glean *some* kind of hypotheses from it. Results vary between interesting biological patterns worthy of further work, perhaps publication, to nothing significant at all.
Matt
----------------------
Matthew Arno, Ph.D.
Genomics Centre Manager
King's College London
The contents of this email are strictly confidential. It may not be transmitted in part or in whole to any other individual or groups of individuals.
This email is intended solely for the use of the individual(s) to whom they are addressed and should not be released to any third party without the consent of the sender.
>-----Original Message-----
>From: Moshe Olshansky [mailto:olshansky at wehi.EDU.AU]
>Sent: 17 June 2011 04:19
>To: wayne xu
>Cc: Arno, Matthew; bioconductor at r-project.org
>Subject: Re: [BioC] PreFiltering probe in microarray analysis
>
>Hi Matt,
>
>Let me note that PCR (or even protein analysis) performed on SAME
>samples
>does not solve the FDR problem. It will only confirm that microarrays
>reported correct expression levels (or fold change). So now we are sure
>that in 3 samples under condition A the level of some gene is indeed
>higher than in 3 samples under condition B, but we still do not know
>whether this is a true phenomenon distinguishing conditions A and B or
>this just happened by chance since we have thousands (or tens of
>thousands) of genes.
>You will need additional (independent) samples to confirm that this is a
>true phenomenon.
>
>Moshe.
>
>> Dear Matt,
>>
>> I read your email again. Since you have lots of thoughts about this
>> issue, I guess you probably have also thought a lot about the
>solutions.
>> Hope my continuing followup is not boring. Please point out if I am
>> wrong in my words.
>>
>> There is no question (actually less questions) about the experiment
>> result such as RT-PCR result of the differentially expressed gene
>> detection.
>>
>> However, when we test many genes in microarray or RNAseq, we do need
>> something like FDR to control how many genes we are going to report.
>> Eeven thought this FDR is not "absolutely true false discovery rate",
>it
>> can work as a relative controller. The point is when different people
>> use the same FDR method the FDR reports should be comparable.
>>
>> Usually people will not do gene prefiltering first, and do it only
>when
>> they find the FDR is too high. If you report a gene list with very
>high
>> FDR, the reviewers will reject the paper. Therefore people try to make
>> an amazing good FDR by gene prefiltering. The same gene list that had
>a
>> high FDR before the gene prefiltering now has a lower FDR. Then the
>> reviewers would be happy with the good FDR.
>>
>> It seems, in some cases," with this FDR method, we have to do gene
>> prefiltering in order to get a good FDR". We can see here that there
>are
>> two problems. One is the FDR method itself, and the other is the gene
>> prefiltering approach.
>>
>> Having thought a lot about these problems, I came out a solution
>called
>> EDR in which I have addressed these problems:
>> http://www.ncbi.nlm.nih.gov/pubmed/20846437
>>
>> Have you read this paper? Do you think that could be one of the
>> standardized solutions? or any comments would be appreciated,
>>
>> Best wishes,
>>
>> Wayne
>>
>> --
>> ----------------------------------------------------------------------
>-
>> Wayne Xu, Ph.D
>> Computational Genomics Specialist
>>
>> Supercomputing Institute for Advanced Computational Research
>> 550 Walter Library
>> 117 Pleasant Street SE
>> University of Minnesota
>> Minneapolis, Minnesota 55455
>> email: wxu at msi.umn.edu help email: help at msi.umn.edu
>> phone: 612-624-1447 help phone: 612-626-0802
>> fax: 612-624-8861
>> ----------------------------------------------------------------------
>-
>>
>>
>>
>> --On 6/13/2011 9:01 AM, Arno, Matthew wrote:
>>> Wayne - I *definitely* mean cheating! It depends on whether the FDR
>is
>>> reported I suppose. Let's say you do a microarray screen and the
>'most
>>> changed' gene that comes up (either by largest fold change or
>smallest
>>> t-test/ANOVA p-value) is 'interesting' biologically speaking. You go
>on
>>> to validate the change (on the same samples and further test sets)
>using
>>> qPCR and or western blots etc., if you go as far as protein analysis.
>>> Therefore you can analyse the importance of that single gene in a
>real
>>> biological context. No one could argue that the gene is not changed
>in
>>> the study and other samples, because of the low-throughput
>validation,
>>> and it makes a nice biological story for a paper. This is regardless
>of
>>> the arrays used, the test used, the FDR or actual p-value even. You
>>> could have picked the gene by sticking a pin in a list; you just used
>an
>>> array to make that pin stick more likely to give a real change.
>>>
>>> However, the statistical factors do definitely matter when you are
>>> trying to report an overall analysis with lots of
>>> genes/patterns/pathways/functions etc, with a wide range of
>conclusions,
>>> perhaps in the absence of being able to perform a high-throughput
>>> validation of every gene (or a proportion of) in the final
>'significant'
>>> list. I can see it from both sides...however, sometimes it's easy to
>>> lose sight that an array hybridisation is just a hypothesis
>generator,
>>> not a hypothesis solver. That said any attempt to standardise this
>sort
>>> of reporting must have parity and (importantly) transparency with all
>>> these factors to have any success.
>>>
>>> I don't actually think there is a single valid answer to this issue,
>as
>>> there are so many interpretations/angles; it's just interesting to
>see
>>> how the debate changes over time. And essential to keep having it
>too!
>>>
>>> Thanks for reading - I have lots of thoughts about this!
>>> Matt
>>> ----------------------
>>> Matthew Arno, Ph.D.
>>> Genomics Centre Manager
>>> King's College London
>>>
>>> The contents of this email are strictly confidential. It may not be
>>> transmitted in part or in whole to any other individual or groups of
>>> individuals.
>>> This email is intended solely for the use of the individual(s) to
>whom
>>> they are addressed and should not be released to any third party
>without
>>> the consent of the sender.
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: wxu at msi.umn.edu [mailto:wxu at msi.umn.edu]
>>>> Sent: 13 June 2011 14:14
>>>> To: Arno, Matthew
>>>> Cc: bioconductor at r-project.org
>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis
>>>>
>>>> Thanks, Matt, for joining this discussion,
>>>>
>>>> It is true from Biologist point of view. You always get the top 10
>>>> genes
>>>> no matter filtering or not. But this shifts to another question, the
>>>> 'amazingly good FDR'. For the same top ten gene, people can report
>>>> different FDRs by filtering or no filtering, or by filtering a
>>>> different
>>>> number of genes. These FDRs in different reports are not comparable
>at
>>>> all. Does this FDR make sense? People can try to make it amazing
>good.
>>>> Does that sound a little 'cheating'? Sorry I do not mean a real
>>>> cheating
>>>> here.
>>>>
>>>> Do you have any thought about this ?
>>>>
>>>> Best wishes,
>>>>
>>>> Wayne
>>>> --
>>>>
>>>>
>>>>
>>>>> Speaking as a pure 'biologist', I think it's OK to pre-filter genes
>as
>>>>> long you know the pitfalls, in terms of the potential bias and
>affect
>>>> on
>>>>> FDRs. I am personally aware of people pre-filtering not only to
>>>> enhance
>>>>> the FDR, but to use the results of a t-test as a starting point for
>a
>>>>> second sequential t-test because the FDRs from this test are
>>>> 'amazingly
>>>>> good'.
>>>>>
>>>>> However statistically sacrilegious this is, the top 10 genes are
>>>> always
>>>>> going to be the same top 10 genes, so if you are just looking for
>the
>>>> top
>>>>> 10 genes, this is essentially OK.
>>>>>
>>>>> How does that hang with you guys?
>>>>>
>>>>> Matt
>>>>>
>>>>> ----------------------
>>>>> Matthew Arno, Ph.D.
>>>>> Genomics Centre Manager
>>>>> King's College London
>>>>>
>>>>> The contents of this email are strictly confidential. It may not be
>>>>> transmitted in part or in whole to any other individual or groups
>of
>>>>> individuals.
>>>>> This email is intended solely for the use of the individual(s) to
>whom
>>>>> they are addressed and should not be released to any third party
>>>> without
>>>>> the consent of the sender.
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
>>>> bounces at r-
>>>>>> project.org] On Behalf Of wxu at msi.umn.edu
>>>>>> Sent: 12 June 2011 16:41
>>>>>> To: Wolfgang Huber
>>>>>> Cc: bioconductor at r-project.org
>>>>>> Subject: Re: [BioC] PreFiltering probe in microarray analysis
>>>>>>
>>>>>> Hi, Dear Wolfgang,
>>>>>>
>>>>>> I think it would nice to bring up a discussion here about the gene
>>>>>> prefiltering issue. Please point me out if this suggestion is
>>>>>> inappropriate.
>>>>>>
>>>>>> There are two questions in the gene filtering which I could not
>find
>>>>>> answers:
>>>>>> 1). In the traditional multiple tests to correct the p-values of
>many
>>>>>> test
>>>>>> groups for example, in a new drug effect experiment, is it
>>>>>> appropriate
>>>>>> to
>>>>>> remove some group tests from the whole experiment? If not, why can
>we
>>>>>> prefilter the genes?
>>>>>> 2). As I stated in the previous email, we assume that the raw
>pvalues
>>>>>> and
>>>>>> the top lowest-pvalue genes are the same before (35k genes) and
>after
>>>>>> gene
>>>>>> filtering (5k genes), the gene x you selected from 35K versus the
>one
>>>>>> selected from 5K, which is more sound? In other words, the best
>>>> student
>>>>>> selected from 1000 students versus the best student selected from
>>>>>> 100,
>>>>>> which is more sound?
>>>>>>
>>>>>> So this is a question of the whole point of gene prefiltering
>>>> approach.
>>>>>> Best wishes,
>>>>>>
>>>>>> Wayne
>>>>>> --
>>>>>>> Hi Swapna
>>>>>>>
>>>>>>> Il Jun/2/11 7:58 PM, Swapna Menon ha scritto:
>>>>>>>> Hi Stephanie,
>>>>>>>> There is another recent paper that you might consider which also
>>>>>>>> cautions about filtering
>>>>>>>> Van Iterson, M., Boer, J. M.,& Menezes, R. X. (2010).
>Filtering,
>>>> FDR
>>>>>>>> and power. BMC Bioinformatics, 11(1), 450.
>>>>>>>> They also recommend their own statistical test to see if one's
>>>> filter
>>>>>>>> biases FDR.
>>>>>>>> currently I am trying variance filter and feature filter from
>>>>>>>> genefilter package: try ?nsFilter for help on these functions.
>>>>>>>> However, I dont use filtering routinely since choosing the right
>>>>>>>> filter , parameters and testing the effects of any bias are
>things
>>>> I
>>>>>>>> have not worked out in addition to having read Bourgon et al and
>>>>>>>> Iterson et al and others that discuss this issue.
>>>>>>>> About your limma results, while conventional filtering may be
>>>>>> expected
>>>>>>>> to increase the number of significant genes, as the papers
>suggest
>>>>>>>> likelihood of false positives also increases.
>>>>>>> No. Properly applied filtering does not affect the false positive
>>>>>> rates
>>>>>>> (FWER or FDR). That's the whole point of it. [1]
>>>>>>>
>>>>>>> If one is willing to put up with higher rate or probability of
>false
>>>>>>> discoveries, then don't do filtering - just increase the p-value
>>>>>> cutoff.
>>>>>>> [1] Bourgon et al., PNAS 2010.
>>>>>>>
>>>>>>>> In your current results,
>>>>>>>> do you have high fold changes above 2 (log2>1)? You may want to
>>>>>>>> explore the biological relevance of those genes with high FC and
>>>>>>>> significant unadjusted p value.
>>>>>>>> Best,
>>>>>>>> Swapna
>>>>>>> Best wishes
>>>>>>> Wolfgang Huber
>>>>>>> EMBL
>>>>>>> http://www.embl.de/research/units/genome_biology/huber
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>--
>Moshe Olshansky
>Division of Bioinformatics
>The Walter & Eliza Hall Institute of Medical Research
>1G Royal Parade, Parkville, Vic 3052
>e-mail: olshansky at wehi.edu.au
>tel: (03) 9345 2631
>
>
>______________________________________________________________________
>The information in this email is confidential and inten...{{dropped:6}}
More information about the Bioconductor
mailing list