[BioC] Limma weights, P values and differential expression

Sat Aug 26 02:52:06 CEST 2006

Dear Alfredo,

At 08:00 PM 25/08/2006, bioconductor-request at stat.math.ethz.ch wrote:
>Date: Thu, 24 Aug 2006 07:29:07 -0700 (PDT)
>From: Alfredo Juncal <alfjuncal at yahoo.com>
>Subject: [BioC] Limma weights,  P values and differential expression
>To: bioconductor at stat.math.ethz.ch
>
>Dear Limma users,
>
>   Although I have read the manual and searched the mailing list, I 
> still do not fully understandf how weights are used when applying 
> the linear model.

In limma, weights are always treated as relative weights in model 
fits. This approach is customary in regression analysis and is 
explained in most statistics textbooks. A consequence of this 
approach is that the absolute size of the weights is immaterial: you 
will get the same differential expression result if you set all the 
weights to 100, or all the weights to 0.01, or all the weights to 1.

At the normalization stage, a loess curve is fitted for each array. 
Here the relative sizes of the weights for different probes are used. 
Probes with lower weights for that array get less weight.

At the differential expression stage, a linear model is fitted for 
each probe. Here it is the relative sizes of the weights for 
different arrays for that probe which are relevant. Arrays with lower 
weights get downweighted. But note, if all the weights for a given 
probe are the same size, it doesn't matter what that size is. If a 
particular probe gets weight 0.001 on every array, then the 
differential expression analysis will be exactly the same as if that 
probe got weight 1000 on every array. The weights are only used to 
differentially weight the different arrays. Weights are not used to 
downweight one probe relative to another in the differential 
expression analysis, at least not unless the weights are actually zero.

>  In our case we have four cDNA microarrays (2 biological replicates 
> with 2 dye-swaps for each) in which we compare a mutant with a wt 
> sample. The microarrays have each gene printed once. We use 
> different spot quality filters and we apply a 0.01 weight to the 
> spots which do not meet these criteria or which are controls. We 
> then apply lmFit with weights and use BH to adjust for multiple 
> testing. These downweighted spots do not affect the Loess print-tip 
> normalization, but how are they treated in the linear model?
>
>   For example, some of the genes with 3 replicate spots with weight 
> 0.01 and 1 with weight 1 have an adjusted P value < 0.05. Is Limma 
> "taking into consideration" that 3 of the spots are suboptimal when 
> calculating the Adj. P.Vals.? Could this gene be considered fairly 
> reliable because although 3 of the spots are suboptimal the fact 
> that they have a similar ratio than the one which is optimal "adds 
> strength" to the differential expression of that gene?

Yes, and that seems reasonable to me.

Also note that weights c(1,0.01,0.01,0.01) are will give the same 
result as c(100,1,1,1). It is only the relative size which is important.

>   If we set the 0.01 weights to 0.00 after the normalization, Adj.P 
> Values are not calculated (as expected displays NA) for those genes 
> with all four weights being 0. However, P values are calculated for 
> those genes with only one weight being 1. How come  there is a Pval 
> in the latter if there is only 1 optimal observation?

... because the Bayesian approach supplies a standard deviation 
estimate even when the data for that gene does not.

>The M values are almost identical doing it this way or leaving the 
>weights at 0.01, but the Adj.P.Vals are somewhat different (Example: 
>3000 genes with P<0.05 doing it either way but around 1000 genes are 
>< 0.05 only by doing it one way or the other). However, no 
>differences with genes over log2ratio 1 or -1 and very slight ones 
>(12 genes)  with genes log2ratio < 0.58 or -0.58.
>
>In summary, after obtaining the Limma results list, would you set an 
>extra filter to select only those genes which have N spots meeting 
>the quality criteria? Is it necessary?  Which value should N be? 
>Thank you very much,

I don't see why you would, and from the results you report is seems 
that limma is doing reasonable things with your data already.

It seems to me that if you really do not want a particular gene to be 
appearing in your gene list, even when the data for that gene seem 
perfectly reasonable, then you should remove that gene from the 
analysis or give the spots zero weight, rather than adding an extra 
ad hoc step after the analysis.

Best wishes
Gordon

>A. Juncal