[R] strange (to me) p-value distribution

Sun Jun 8 00:40:22 CEST 2008

Dear Mark,

try out the example code below. Such a p-value distribution often occurs 
  if you have "batch" effects, i.e. if the between-group variability is 
in fact less than the within-group variability.

In the example below, I do, for each row of x, a t-test between the 
values in the even and odd columns; for rt2, a "batch effect" has been 
added to columns 1:10.

  hope this helps
	Wolfgang

library("genefilter")

nr = 31000
nc = 20

x  = matrix(rnorm(nr*nc), nrow=nr, ncol=nc)

rt1 = rowttests(x, factor(1:nc %% 2))

## add a batch effect
x[, 1:10] = x[, 1:10] + pi/2
rt2 = rowttests(x, factor(1:nc %% 2))

par(mfrow=c(2,1))
hist(rt1$p.value, breaks=100, col="mistyrose")
hist(rt2$p.value, breaks=100, col="mistyrose")

------------------------------------------------------------------
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber

Mark Kimpel a écrit 07/06/2008 18:39:
> I'm working with a genomic data-set with ~31k end-points and have
> performed an F-test across 5 groups for each end-point. The QA
> measurments on the individual micro-arrays all look good. One of the
> first things I do in my work-flow is take a look at the p-valued
> distribution. it is my understanding that, if the findings are due to
> chance alone, the p-value distribution should be uniform. In this case
> the histogram, even with 1000 break points, starts low on the left and
> climbs almost linearly to the right. In other words, very skewed
> towards high p-values. I understand that this could be happening by
> chance alone, but the same behavior is seen in the two contrasts of
> interest I looked at and I have seen it in a couple of our other
> genomic, high-dimensional experiments as well. I might also add that I
> looked at the actual numbers of genes with p-val < X and indeed, for
> each X < 0.05, there are far fewer sig. genes than one would expect by
> chance.
> 
> I can't figure out what is causing this and, if there is a cause, I'd
> like to be able to tell the experimenter if it indicates a technical
> factor. I've had other experiments where the p-value dist approximates
> normal and of course those that have nice spikes at low p-values
> indicating we have some significant genes.
> 
> I'm addressing this hear rather than to BioC because I suspect there
> is some basis statistical mechanism that could explain this. Is there?
> 
> Mark
>