[BioC] Selecting p-value cutoffs for differential expression

Ramon Diaz-Uriarte rdiaz at cnio.es
Wed Sep 8 10:14:31 CEST 2004


On Tuesday 07 September 2004 17:20, Matthew  Hannah wrote:
> Hi,
>
> Before the main question a minor point from the limma guide (as I'm
> using it to compute the p-values). In the swirl example there is the
> following sentence after the toptable is produced, are the stats not
> independent because there are duplicate spots, or is there another
> reason that I should be aware of?
>
> "Beware that the Benjamini and Hochberg method used to control the false
> discovery rate assumes independent statistics which we do not have here
> (see help(p.adjust))."


Actually, the control also holds for "positive regression dependency" (details 
in the Benjamini and Yekutilei 2001 paper) which some people argue is 
actually what is common with microarray data, because of the  "(...) tendency 
of measurement errors of gene expressions to be positively correlated 
(...)" (p. 370 in Reiner, Yekutieli and Benjamini, 2003, Bioinformatics, 19
(3): 368--375). Anyway, the results in the paper by Reiner et al. show 
(convincigly to me, at least) that using the BH procedure we do control the 
FDR at the desired level. 

Briefly, then, I do not worry a lot about the non-indep. of the statistics 
when I use BH with microarray data.


>
> Anyway, this aside. I'm looking to canvas opinion on how to select a
> p-value cutoff for genes that are differentially expressed, hopefully
> also allowing an assessment of false positive and negative rates aswell.
> I've been playing around with the following, but none seems
> satisfactory. Anyone have any input/experience on this topic?

This is probably a useless answer, but I think a crucial issue is the 
objective of the study. If those p-values are used to select a set of 50 
genes for RT-PCR where you have to spend pre-allocated budget for exactly 50 
genes well, chose the top 50. But if you will only continue with a follow up 
if the evidence is "strong enough", then you will want to weight, somehow, 
what strong is compared to the costs on not doing the follow up on some hiden 
gem with not-low-enough-p. And I think we have to ponder those issues in 
relation to other sources of error (e.g., is your statistical model ---the 
one that leads to the undadjusted p-values--- reasonable?), or to the 
representativeness issue (what are we willing to say when our adjusted p 
<10^-9 comes from an observational study with 3 schizophrenic patients and 4 
bipolar patients?).


Best,

R.
>
> 1.Look at p-values for genes that are not called present in any of the
> arrays, I suspect some are slipping through as there is still a peak of
> low p-values.
>
> 2.Look at p-values for genes that have not been previously reported as
> regulated by the treatment - but most previous work is poorly replicated
> and has arbitary cutoffs such as 2 fold, so big peak of low p-values -
> not as big as for those that have been previously reported though - any
> ideas how to use this difference?
>
> 3.Use a set of control or house-keeping genes to define a lower cut-off
> - unfortunately some do respond to the treatment (also confirmed in
> previous work), so how to select appropriate genes...
>
> 4.As it seems that gcrma values have a bimodal distribution - any ideas
> on how to utilise the lower peak (that presumably represents 'absent'
> genes), to calculate a threshold.
>
> 5.Choose a fdr p-value of 0.01, 0.001 or 0.0001, assuming they are
> approximately giving you corresponding false positive rates?
>
> 6. 'Decide' how many genes you want to be differentially expressed, and
> then select one of the above criteria appropriately, this obviously
> works as you'd like ;-) but is tricky to justify!
>
> Cheers,
> Matt
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://ligarto.org/rdiaz
PGP KeyID: 0xE89B3462
(http://ligarto.org/rdiaz/0xE89B3462.asc)



More information about the Bioconductor mailing list