[BioC] Your thoughts on Limma Array Weights?
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Jun 28 03:28:32 CEST 2008
Matt and Paul have already given nice summaries, so I will add only very
brief comments.
1. If most genes are not DE, then the design matrix for arrayWeights does
not need to be as complex as for the final linear model. Eg., in the 2 vs
2 case you should usually use design=1 to estimate the array weights,
otherwise each array is compared only to its partner rather than to the
other 3 arrays.
2. I use array weights when there is some reason to expect variable array
quality, otherwise I don't. Eg, RNA samples from human patients almost
always vary in quality, so I almost always use array weights with human in
vivo data. See eg Ellis et al, Clinical Cancer Research, to appear 14
July 2008. If array quality plots suggest a problem, then I use weights.
If RNA is plentiful, eg from cell lines or model organisms, and the arrays
look good, then I don't use array weights.
3. In some cases, it is better to remove arrays altogether. However Matt
and I believe that people are generally too trigger-happy with this.
Arrays have to be very bad indeed before they contain no useful
information, so there is a wide range of situations in which better
results can be had by weighting. In gross cases where an array is clearly
bad or wrong, and should be removed, you'll usually know it.
4. The arrayWeights method is largely protected from double-dipping by
the fact that each gene has negligible influence on the array weights.
Best wishes
Gordon
> Date: Fri, 27 Jun 2008 14:41:27 +1000
> From: "Paul Leo" <p.leo at uq.edu.au>
> Subject: Re: [BioC] Your thoughts on Limma Array Weights?
> To: "Mark Cowley" <m.cowley0 at gmail.com>
> Cc: bioconductor at stat.math.ethz.ch
> Message-ID:
> <DE3D1F203DAF7A4CB259560D2801DF8B02B6866A at UQEXMB2.soe.uq.edu.au>
> Content-Type: text/plain; charset="US-ASCII"
>
> HI Mark,
> The example I showed was a triplicate biological replicate and there I
> decided to do use the weights over removing the strange array. In a
> previous experiment I removed the "strange" array and did not use
> weights. But I did not elaborate; in that latter case there was some
> biology at work. The cells used in that experiment were all grown form
> tissues and then passed through a process of differentiation. So I had
> reason to believe the strange array was different due to biology (state
> of differentiation perhaps) rather than some other "more technical"
> effect (as the array and RNA otherwise appeared fine in other QC
> measurements). So with that additional motivation I decided to drop it,
> as it appeared to be a true biological outlier, and no magical weighting
> can fix that, but you never know for sure if you've made the right
> decision, you can only see if what you get makes sense... use GOstats
> etc.
>
> The paper describing the arrayWeights method
> http://www.biomedcentral.com/content/pdf/1471-2105-7-261.pdf
> shows that in simulations with 3 arrays per class that dropping a poor
> array is worse than just keeping it and the best performance is obtained
> from the weighting strategy. **My impressions** from hearing Gordon
> Smyth talk is that his philosophy is to keep the arrays you have and
> weight them appropriately, to mot remove them. Does he always weight
> them in his own work, I don't know. It's probably in the grey judgment
> call zone, like everything else.
>
> That paper also shows the same results for a simple t-test, arguing
> that there is no strange "double-dipping in linear models" when using
> linear models to get weights and a linear model to get differential
> expression (though of course neither is just a simple linear model)
>
> So, along the same lines unless there is something particularly glaring
> you should keep your 2 arrays per class and perhaps try the weighing,
> and gently remind your investigators (using bolt-cutters and a
> blow-torch) how difficult it is to get a standard deviation from 2
> measurements...
>
> Best of luck
> Paul
>
>
>
> -----Original Message-----
> From: Mark Cowley [mailto:m.cowley0 at gmail.com]
> Sent: Friday, June 27, 2008 11:37 AM
> To: Matt Ritchie
> Cc: Paul Leo; bioconductor
> Subject: Re: [BioC] Your thoughts on Limma Array Weights?
>
> Hi Paul and Matt,
> have either of you compared situations with only small number of
> arrays in a 2 group comparison, eg 2 vs 2, or 3 vs 3 and your either
> throw one array away (due to QC), or just down-weight it?
> I commonly throw the poor array away, but one data set that i'm
> currently working on only has 2 vs 2 (read: "pilot experiment"), and
> when you remove an array, then it's 2 vs 1 which is not much fun.
>
> cheers,
> Mark
>
> On 27/06/2008, at 9:53 AM, Matt Ritchie wrote:
>
>> Dear Paul,
>>
>> I have noticed cases where the results are 'better' (i.e. you get more
>> extreme moderated t-statistics or log-odds) if you remove suspect
>> arrays.
>> In one recent example I recall, the experimenter eventually
>> discovered that
>> the genotype of a sample hybridised to one of their arrays was not
>> what they
>> originally thought. This meant that the linear model they were
>> fitting was
>> not right. Although the weight assigned to this array was small,
>> removing
>> it from the analysis altogether still produced better results. The
>> array
>> weights method cannot correct for these kinds of gross errors.
>>
>> I usually take a try it and see approach in my own analyses, similar
>> to what
>> you have done (i.e. run the analysis with equal weights, with array
>> weights,
>> or after removing any suspect arrays altogether, then look at the
>> results of
>> each to see which gives the most extreme statistics).
>>
>> Best wishes,
>>
>> Matt
>>
>>> I use limma quite a bit but have not really been using arrayWeights
>>> much, until recently.
>>> I like it a lot but have found, in some cases, that it appears
>>> better
>>> just ditch the very poorly performing arrays..and then I proceed
>>> without
>>> weighing .
>>>
>>> What are peoples real world experience with arrayWeights, are you
>>> using
>>> it routinely ?
>>>
>>> For example my typical usage... time series with biological
>>> triplicates
>>>
>>>> design
>>> t0hr t6hr t24hr t24p6hr
>>> 1 1 0 0 0
>>> 2 1 0 0 0
>>> 3 1 0 0 0
>>> 4 0 1 0 0
>>> 5 0 1 0 0
>>> 6 0 1 0 0
>>> 7 0 0 1 0
>>> 8 0 0 1 0
>>> 9 0 0 1 0
>>> 10 0 0 0 1
>>> 11 0 0 0 1
>>> 12 0 0 0 1
>>>
>>> arrayw<-arrayWeights(selDataMatrix,design=design)
>>>> arrayw
>>> 1 2 3 4 5 6 7
>>> 8 9
>>> 1.6473168 1.2716081 1.5170375 1.0310794 1.1010048 1.2787543 0.8198722
>>> 0.7162097 2.3992850
>>> 10 11 12
>>> 0.1744961 1.3821469 0.6379648 ## note array 10: which was a
>>> outlier in
>>> hierarchical clustering (though was still more similar to arrays its
>>> biological replicates than any other arrays (based on genes where
>>> sd/mean> 0.1)..
>>>
>>> fit <- lmFit(selDataMatrix, design,weights=arrayw)
>>> fit <- lmFit(selDataMatrix, design)
>>>
>>> cont.matrix <- makeContrasts(
>>> tchange6hr="t6hr-t0hr" ,
>>> tchange24hr="t24hr-t0hr" ,
>>> tchange24p6hr="t24p6hr-t0hr" ,
>>> diff0to6="t6hr-t0hr" ,
>>> diff6to24="t24hr-t6hr" ,
>>> diff24to24p6="t24p6hr-t24hr" ,
>>> levels=design)
>>>
>>> fit2 <- contrasts.fit(fit, cont.matrix)
>>> fit2 <- eBayes(fit2)
>>>
>>> ** Get
>>>> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 2927
>>>> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1)
>>> [1] 5263
>>>> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 2083
>>>> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 2927
>>>> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 2931
>>>> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 3810
>>>
>>> ####################### AS APPOSED TO THE TYPICAL:
>>>
>>> fit <- lmFit(selDataMatrix, design)
>>> fit2 <- contrasts.fit(fit, cont.matrix)
>>> fit2 <- eBayes(fit2)
>>>
>>> ** Get
>>>> sum(topTable(fit2,coef=1,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 1725
>>>> sum(topTable(fit2,coef=2,adjust="fdr",number=6000)[,"B"]>1)
>>> [1] 3438
>>>> sum(topTable(fit2,coef=3,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 1512
>>>> sum(topTable(fit2,coef=4,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 1725
>>>> sum(topTable(fit2,coef=5,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 1605
>>>> sum(topTable(fit2,coef=6,adjust="fdr",number=5000)[,"B"]>1)
>>> [1] 2610
>>>
>>> Is more differential expression better .. always... I guess so
>>> unless
>>> there are more false positives right? I am slightly worried that in
>>> using a linear model to access array quality and produce weights ,
>>> that
>>> this will then naturally bias a method such as limma that then
>>> using a
>>> linear model, again, to determine differential expression. After
>>> trying
>>> a few different permutations (use weights, remove "worst" arrays and
>>> redo without weights) that this is not a big concern but would
>>> welcome
>>> some feedback from others and insight into how they are using this
>>> function .
>>>
>>> Thanks
>>> Paul
More information about the Bioconductor
mailing list