[BioC] ttest or fold change

Tue Dec 16 23:45:28 MET 2003

Garrett et al,

The t-test (or ANOVA) does not have a problem with "accidentally too
small" variances, either with one or more than one outcome of interest.
The estimate of the error variance by t-tests and ANOVA is a Least
Squares estimate and is the UNBIASED ESTIMATOR that is also the lower
bound on the variance for the "best" (minimum variance) linear unbiased
estimator (BLUE) of the effects being tested (see Graybill 1976).  

Some bayesian methods can generate smaller estimates of variances by
biasing the estimate toward some overall measure such as the average of
variances for nearby genes.  These are BIASED estimates based on an
assumption that a particular gene should really be like genes that are
"nearby" in some sense, such as they have similar expression levels.
You would have to present a lot of data to me to convince me that any
randomly selected gene should have a variance like some other set of
genes, especially when I have an unbiased estimate at hand that is
non-controversial, requires no defense, and uses methods that have
withstood 100 years of review and scrutiny. I'm familiar with shrunken
estimates of effects that can have a smaller "mean squared error", but
these are random effects, not variances which control the power and type
I error rate.  

These approaches, in addition to producing biased estimates sometimes
require the analyst to impose his or her own particular biases, called
"prior beliefs" or "priors" on as to how much these estimates should be
biased by requiring that the analyst input how much weight is given to
the data from that gene and how much weight is given to the other set
that the gene is supposed to "be more like".  Again, it would take some
pretty strong arguments to convince me that any particular analysts
prior beliefs about how much the data for a gene or data from other
genes should or should not be weighted.  I would be concerned about  how
much convincing a readership, reviewer, or study group would need if
they ever decide to "open the black box" and ask me to explain why such
an approach is reasonable/justifiable.  

The program Garrett mentioned, Cyber-T, uses such an approach.  To quote
the Cyber-T manual "...This weighting factor IS CONTROLLED BY THE
EXPERIMENTER AND WILL DEPEND ON HOW CONFIDENT THE EXPERIMENTER IS that
the background variance of a closely related set of genes approximates
the variance of the gene under consideration".  Now if one was looking
at just ONE  gene, it makes sense that someone might put a lot of
thought into it, have looked at a lot of similar genes or other data and
come to the conclusion that a gene should be like some other genes and
THEN use this approach.  But this is not the case when you have 10,000
or 22,000 genes, at least not in the world I'm familiar with. 

I use empirical bayes methods for fitting general linear mixed models,
where the priors are objective, not my own opinion.  Cyber-T does offer
the option of setting low confidence in the prior which is an objective
prior, but the manual points out that this results in the standard
Student t-test!  Another feature of Cyber-T is that when you have
"enough" data, the weighted approach converges into the standard t-test
as well.  

The real problem that researchers face with microarrays is NOT that
their t-test variances are too small, but that they often have
insufficient sample to detect the differences they need to detect. The
ready solution is to get enough data.

-.- -.. .---- .--. ..-.
Stephen P. Baker, MScPH, PhD (ABD)            (508) 856-2625
Sr. Biostatistician- Information Services
Lecturer in Biostatistics                     (775) 254-4885 fax
Graduate School of Biomedical Sciences
University of Massachusetts Medical School, Worcester
55 Lake Avenue North                          stephen.baker at umassmed.edu
Worcester, MA 01655  USA

------------------------------

Message: 6
Date: Tue, 16 Dec 2003 10:24:31 -0500
From: "Garrett Frampton" <gmframpt at bu.edu>
Subject: RE: [BioC] ttest or fold change
To: <bioconductor at stat.math.ethz.ch>
Message-ID: <00b801c3c3e8$b3ed2cc0$e1be299b at GARRETT>
Content-Type: text/plain;	charset="US-ASCII"

Dr. Baker,

You wrote about "the problem" that the t-test denominator may be
accidentally "too small".  You say that this issue has been solved
within the T-test.  It is my belief that this problem has only been
partially solved.  It is true that this "problem" has been solved for a
single hypothesis test within the T-test, but it has not been solved for
microarray data analysis as a whole.

It is possible to gain power by using local estimates of variance based
upon more than one gene.  This sort of approach is extremely useful for
experiments with only a few replicates because it deals with the
situation where the within group variance for a single gene happens to
be very small. This is the approach implemented in Cyber-T;
http://visitor.ics.uci.edu/genex/cybert/.  By looking at the dataset as
a whole, rather than 1 gene at a time, it is possible to eliminate
false-positives that arise as a result of coincidentally low within
group variance.

Do you agree?
Other than this minor point I think you did a wonderful job putting the
statistical concepts that so many struggle with into words.

Garrett Frampton
Research Associate
Boston University School of Medicine - Microarray Resource