[BioC] GCRMA: low intensity exprs estimates / pval distributions

Wed Mar 16 17:11:29 CET 2005

Hi,

Just looking for opinions again. Basically any comments on the bimodal
p-value distributions that can sometimes be observed when limma
(although I guess this would apply for any test) is used to test for
differential expression of data using the gcrma expression estimate.

In my last mail I didn't really label the plots clearly (see links below
or on BioC archives)

First plots - comparisons of expression estimates using GCRMA1.1.0,
1.1.3, Fast=T or F -> There are big differences.

Second plots - p-value distributions from limma comparison
(lmFit,eBayes) of 3 vs. 3 arrays (out of a larger 50 array set) for each
GCRMA normalisation -> The standard (fast=T) GCRMA algorithms produce a
peak at high p-values in addition to the standard distribution.

I also noticed the convest function in limma, but after a quick glance
at the linked paper, that too shows the 'standard' p-value distribution
(as produced by the gcrma (fast=F)). In fact I've not seen any
discussion or recieved any comment on how these non-standard p-value
distributions should be interpreted! Any takers?

(see previous mail for more details)

Thanks,
Matt

####previous message####
Hi,

I noticed this a while ago but with some of the recent threads, maybe
now is suitable for a general discussion.

This will be easiest if you view the attached files on the bioconductor
archive site.

Basically GCRMA changed it's BG parameter estimation from using a low
quantile of strata of affinity levels (1.1.0 or less) to a smoother way
using loess. There is also a fast=FALSE option which does not use the
(default) faster ad-hoc algorithm (MLE vs. EB?).

If you compare v1.1.0 and 1.1.3 (current stable release) (+/- fast=F)
there are significant differences in the expression estimates,
particularly at the low end. This is not really too surprising as the
data is noisy and each measure will have its own specifics. What is more
interesting are changes in expression. I looked at a simple 3 vs. 3
comparison (limma, ebayes) within a larger normalized dataset (~50
arrays) and as you can see high p-values are over-represented when the
default(fast=T) version is used. To me this questions whether the
statistical test would still be valid, also it raises questions about
estimating true/false -/+tives.

I think (quick bioC search but no documentation) that a step-up FDR is
used within p.adjust (used in limma). Could such a distribution affect
the validity of using FDR correction. Or is this the p.value equivalent
of having positive dependency of the test statistics?

This all results from the different intensity distributions from GCRMA.
All are bimodal which is likely to result from the genes that are not
present giving the peak at lower intensities. I guess that these absent
genes are responsible for the over-representation of high p-values as
these genes are just BG. However, I prefer to work with the fast=F
version due to their more conventional p-value distributions.

As a thought - I assume a peak area extraction of the lower peak might
be a nice way of detecting the number of 'present' genes.

Any comments?

Cheers,
MAtt

-------------- next part --------------
A non-text attachment was scrubbed...
Name: GCRMA_comparison.png
Type: image/png
Size: 10346 bytes
Desc: GCRMA_comparison.png
Url :
https://stat.ethz.ch/pipermail/bioconductor/attachments/20050314/1ce953b
7/GCRMA_comparison.png
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GCRMA_comparison_Limma.pvals.png
Type: image/png
Size: 9072 bytes
Desc: GCRMA_comparison_Limma.pvals.png
Url :
https://stat.ethz.ch/pipermail/bioconductor/attachments/20050314/1ce953b
7/GCRMA_comparison_Limma.pvals.png