[BioC] Harsh results using limma!

michael watson (IAH-C) michael.watson at bbsrc.ac.uk
Mon Aug 16 16:06:09 CEST 2004

Hi Tony

I take your point, but I am no longer talking about my pitifully
replicated three-animal experiment ;-)  

I am extending the argument to the larger case where I might have plenty
of replicated experiments and still want to find significant genes
amongst sub-populations.  I've had one e-mail which shows that this is
possible, so I am happy.

What everyone needs to understand, though, is that on one hand we have
statisticians saying we should have 100s of replicates and on the other
hand a legal obligation and an ethics committee saying we have to use as
few animals as possible.  There's also an element of "don't shoot the
messenger"; I didn't design the experiment, in fact the 1st I heard of
it was when I was asked to analyse it (hands up on the list who's
experienced that....).  I *know* there should be more replicates, but at
the end of the day, I need to make as much use of the data I have as is
possible, and any number of "Argh"'s and "you need more replicates"'s
are not going to help me :-D



-----Original Message-----
From: A.J. Rossini [mailto:rossini at blindglobe.net] 
Sent: 16 August 2004 14:57
To: michael watson (IAH-C)
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Harsh results using limma!

Argh.  You can't really draw conclusions (even discovery) with
biological variation from three animals without bringing in extra

Suppose you don't have extra information.  Think about the possibilities
of the next 3 animals.  under a conservative assumption that half the
population is diff expr'd, Reasonable options are:

1. the next 3 show non-differential results (probability = 1/8th,
   which is not unreasonable!)    So, you've got 1/3 of the population
   responding (possibly dropping lower...).

2. 2 are non-differential (probability 3/8ths)

3. 2 are differential  (probability 3/8ths)

4. 3 are differetial (probability 1/8), and you are happier (except
   for "wasting" $$...).

So unfortunately, your claim of "highly repeatable" sounds more like
"wishful thinking", if you look at the possibilities.

 (now, perhaps you are bringing in more biological insight into the
problem, and it's not following a discovery paradigm, i.e. the  insight
is a-priori -- then a Bayesian decision making procedure  might be
reasonable to look at the strength of evidence; but in this  case, you
might just make a decision heavily weighting biology data  rather than
expression data).  

 That is, use the data to generate hypotheses, and "confirm" using
annotation and metadata.  I've always found this approach suspect,  but
it tends to occur in  practice and be "believeable" to some groups. 


"michael watson (IAH-C)" <michael.watson at bbsrc.ac.uk> writes:

> Hi Guys
> Well this turned into a very interesting discussion, thank you for 
> your inputs.  All of the explanations lead to a single conclusion, and

> that is that I (we?) need to find significant differences which are 
> present in only subsets of the data.
> Let me explain - here I had samples from three animals.  Two animals 
> showed what looks like highly-repeatable differential expression, and 
> the third did not.  If we make the assumption that this is down to 
> biological variation (ie two of my animals showed an immune response, 
> the third did not, simply because they are different animals), then 
> standard statistical tests are missing an effect which is present in 
> two thirds of my population.  If you ask me "are you interested in 
> finding effects which are present in only two thirds of your 
> population?" then the answer is of course I am!
> Over the last 5 years the whole issue of pharmacogenomics became huge,

> the right drug for the right patient etc, and I know I am speculating 
> wildly here, but perhaps what my data is showing me is exactly that - 
> that two-thirds of my population show a particular immune response but

> the other third does not.  And that's very interesting ;-)
> Now, to the non-statistician, the "bull in a china shop" approach to 
> solving this would appear to be to take all possible subsets of my 
> data and running limma on them, to find significant changes in subsets

> of my data.  Clearly this becomes problematic for large datasets.  
> Presumably there are many more intelligent ways....?
> Thanks again
> Mick
> -----Original Message-----
> From: Gordon K Smyth [mailto:smyth at wehi.edu.au]
> Sent: 14 August 2004 01:07
> To: David K Pritchard
> Cc: Anthony Rossini; bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Harsh results using limma!
>> I think Mick's experiences point out a fundamental problem with
>> current statistical analysis of microarray data.  If his data was .2,

>> .2, .2,  (dye flips) -.2, -.2, -.2 then Limma would note this gene as
> highly differentially expressed.  In contrast when he sees 6.29, 5.54,

> 0.2, (dye
>> flips)-5.27,-4.61,   -0.2 Limma did not mark it as differentially
> expressed.
> Actually it is not true that limma will necessarily rank the first 
> gene higher than the second. Obviously t-tests would do so, but limma 
> may well rank the second gene higher depending on the information 
> about variability inferred from the whole data set.  Looking at fold 
> change alone ranks the second gene higher while t-tests would rank the

> first higher.  Limma is somewhere in between depending on the dataset.

> A typical microarray dataset actually would lead to the second gene 
> being ranked higher, i.e., would lead to the ranking that you would 
> prefer.
>>      As a biologist I would argue the case for the genes actually
>> being differentially expressed is much higher in the second case.
>> using modified T-statistic approaches and with the limited number of
>> repeats common with current array experiments,  I see array
> experiments "missing" these very interesting high variance genes all 
> the time.
>>     Current analytical techniques put a high premium on consistency 
>> of
>> results and a lower premium on strength of differential expression
>> which is the parameter that biologists would argue is the most
> significant.
>>      There are a variety of biological reasons why high variance 
>> genes
>> should exist and personally I think these genes are likely to be the
>> biologically interesting ones that we should be looking for on
> microarrays.
>>      I understand why Limma does what it is does and it is a
>> fantastically useful program. However, I would suggest to the 
>> statisticians reading this message  that it would be very useful to 
>> start developing analytical techniques which could better detect high

>> variance genes.
> I agree with the overall point.  Two strategies currently available 
> are: 1. Use spot quality weights.  In the example given above it 
> appears that two of the arrays or spots have failed to register any 
> worthwhile fold change for a gene which is differentially expressed on

> the other arrays. If this can be identified as being due to low 
> quality spots or arrays, then the values may be down-weighted in an 
> analysis and the gene will revert to being highly significant. 2. If 
> small fold changes are not of biological interest to you, then you can

> require a minimum magnitude for the fold change as well as looking for

> evidence of differential expression.
> Gordon
>> David Pritchard
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch 
> https://stat.ethz.ch/mailman/listinfo/bioconductor

Anthony Rossini			    Research Associate Professor
rossini at u.washington.edu            http://www.analytics.washington.edu/

Biomedical and Health Informatics   University of Washington
Biostatistics, SCHARP/HVTN          Fred Hutchinson Cancer Research
UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable
FHCRC  (M/W): 206-667-7025 FAX=206-667-4812 | use Email

CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

More information about the Bioconductor mailing list