[R] strange (to me) p-value distribution
Wolfgang Huber
huber at ebi.ac.uk
Sun Jun 8 00:40:22 CEST 2008
Dear Mark,
try out the example code below. Such a p-value distribution often occurs
if you have "batch" effects, i.e. if the between-group variability is
in fact less than the within-group variability.
In the example below, I do, for each row of x, a t-test between the
values in the even and odd columns; for rt2, a "batch effect" has been
added to columns 1:10.
hope this helps
Wolfgang
library("genefilter")
nr = 31000
nc = 20
x = matrix(rnorm(nr*nc), nrow=nr, ncol=nc)
rt1 = rowttests(x, factor(1:nc %% 2))
## add a batch effect
x[, 1:10] = x[, 1:10] + pi/2
rt2 = rowttests(x, factor(1:nc %% 2))
par(mfrow=c(2,1))
hist(rt1$p.value, breaks=100, col="mistyrose")
hist(rt2$p.value, breaks=100, col="mistyrose")
------------------------------------------------------------------
Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
Mark Kimpel a écrit 07/06/2008 18:39:
> I'm working with a genomic data-set with ~31k end-points and have
> performed an F-test across 5 groups for each end-point. The QA
> measurments on the individual micro-arrays all look good. One of the
> first things I do in my work-flow is take a look at the p-valued
> distribution. it is my understanding that, if the findings are due to
> chance alone, the p-value distribution should be uniform. In this case
> the histogram, even with 1000 break points, starts low on the left and
> climbs almost linearly to the right. In other words, very skewed
> towards high p-values. I understand that this could be happening by
> chance alone, but the same behavior is seen in the two contrasts of
> interest I looked at and I have seen it in a couple of our other
> genomic, high-dimensional experiments as well. I might also add that I
> looked at the actual numbers of genes with p-val < X and indeed, for
> each X < 0.05, there are far fewer sig. genes than one would expect by
> chance.
>
> I can't figure out what is causing this and, if there is a cause, I'd
> like to be able to tell the experimenter if it indicates a technical
> factor. I've had other experiments where the p-value dist approximates
> normal and of course those that have nice spikes at low p-values
> indicating we have some significant genes.
>
> I'm addressing this hear rather than to BioC because I suspect there
> is some basis statistical mechanism that could explain this. Is there?
>
> Mark
>
More information about the R-help
mailing list