[BioC] limma, subsets of design matrix and p-values
Naomi Altman
naomi at stat.psu.edu
Sun Mar 13 19:24:39 CET 2005
Analysis of variance (ANOVA) and related methods such as SAM and eBayes in
Limma produce p-values that depend on the entire data array, not just the
treatments of interest.
There are 3 main reasons for this. In the discussion below "t-value"
refers to the ordinary t-value for the contrast (for ANOVA), a rank version
(for tests like the Wilcoxon), permutation versions (for SAM and other
permutation routines) and (e)Bayesian versions for routines like Limma eBayes.
All the available data are used to estimate the "within variance" which is
the denominator of the test statistic so the "t-value" (or equivalent)
depends on all the data, not just the 2 treatments being compared
The degrees of freedom of the test are determined by the data used to
estimate the "within variance", so even if the "t-value" is the same, the
p-value will not be.
In some cases, multiple comparisons adjustments are built-in to the
procedures, and these become more stringent as the number of treatments
increases.
Any good intro text on ANOVA will elucidate these issues.
--Naomi
At 02:15 PM 3/10/2005, Mike Schaffer wrote:
>I've run limma for a few months and had a question about the p-values
>being calculated from a large set of data vs. just a subset.
>
>
>All of my 2-color array data is relative to an untreated sample and each
>is replicated three times.
>For example:
> Treatment1 vs untreated x 3,
> Treatment2 vs. untreated x 3,
> Treatment3 vs. untreated x 3
> ...etc.
>
>So I have a large MA object of all the data that is normalized within
>arrays and a design matrix created by:
>
>design <- modelMatrix(targets,ref="untreated")
>
>
>My confusion stems from the fact that if I run eBayes on the entire
>dataset (code below), I get different p-values (but same M values) for the
>TreatmentX vs. untreated, than if I fit the subset that only includes data
>for one treatment (e.g. just Treatment 1 vs. untreated).
>
>For example,
>
>fit <- lmFit(MA,design=design)
>eb <- eBayes(fit)
>eb$p.values[1:10,1]
>
>
>gives different p-values than if I were to only initially subset on the
>Treatment1 vs. untreated data, and then run lmFit.
>For example,
>
>design <- modelMatrix(targets[1:3,],ref="untreated")
>fit <- lmFit(MA[,1:3],design=design) # the first three data sets
>include ALLof the Treatment1 vs. untreated data
>eb <- eBayes(fit)
>eb$p.values[1:10]
>
>
>Am I incorrect to assume that the p-values should be the same regardless
>of how much data is included in the MA object, as long as the design
>matrix has no overlap between experiments (e.g. treatment1 vs. untreated
>data is distinct from treatment2 vs. untreated data) -- aside from the
>fact they are all relative to an untreated sample?
>
>Or is the moderated t-statistic based on ALL of the data in the MA object
>regardless of each experiment's relationship to others?
>
>I'd like to read in all the data and keep it in one large RG and MA
>object. If I'm looking to determine the p-values for genes induced
>relative to untreated by just one of the treatments, should I use lmFit on
>the large MA object or should it be subset first (e.g MA[,1:3]) before
>doing the linear fit?
>
>Am I missing something?
>
>Thanks, in advance, for any help.
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348 (Statistics)
University Park, PA 16802-2111
More information about the Bioconductor
mailing list