[BioC] LIMMA lmscFit: difference between linear models

Mon Sep 26 20:37:07 CEST 2005

Hi Koen,

Both models are "valid", but that doesn't mean that they're equally
good.

Here is the main difference between the two models (please correct me if
I'm wrong):

In the first one, you are trying to fit a single coefficients for each
gene in patient and control. This interprets all variations between
patients and controls as error and gives you a rather high variance. So,
for each gene you have a 2 value (for the patients and for the controls)
which you won't trust much because of the variance.

In the second case, you're fitting each patient and control separately
first. Assuming you have decent arrays, then the variance between the
replicates is pretty low. So now you have 8 values (4 for the patients,
4 for the controls) which you'll trust more, because they have lower
variances. Then those are then used for the contrast.

So the second method is more powerful, because it can model the
technical variation separately from the patient variation. This is the
intuition why the second method can call more genes as being
differentially expressed.

Francois

On Mon, 2005-26-09 at 10:46 +0200, Koen Bossers wrote: 
> Dear all,
> 
> I'm currently studying gene expression in human brain samples using 
> Agilent arrays (4 patients, 4 controls). I am analyzing the data using a 
> single channel approach (lmscFit), which I think valid for my dataset, 
> as all channels from each individual nicely cluster together, apart from 
> all other channels.
> 
> The hybridization setup is as follows:
> 
> --------------
>                                    Cy3       Cy5
> US12302316_251182110152_S01_A01   pat2      ctrl3
> US12302316_251182110153_S01_A01   pat3      ctrl2
> US12302316_251182110154_S01_A01   pat1      ctrl4
> US12302316_251182110155_S01_A01   ctrl4     pat4
> US12302316_251182110156_S01_A01   ctrl1     pat1
> US12302316_251182110157_S01_A01   pat4      ctrl1
> US12302316_251182110158_S01_A01   ctrl2     pat4
> US12302316_251182110159_S01_A01   pat2      ctrl1
> US12302316_251182110160_S01_A01   ctrl3     pat1
> US12302316_251182110176_S01_A01   pat3      ctrl4
> US12302316_251182110177_S01_A01   ctrl2     pat2
> US12302316_251182110178_S01_A01   ctrl3     pat3
> 
> --------------
> 
> The first analysis I tried was the following: I replaced all individual 
> labels with a generic one (thus: pat1 becomes pat, ctrl2 becomes ctrl), 
> and calculated a contrast between pat&ctrl:
> 
> --------------
> 
> targets2 <- targetsA2C(targets)
> u <- unique(targets2$Target)
> f <- factor(targets2$Target, levels=u)
> design <- model.matrix(~0+f)
> colnames(design) <- u
> 
> corfit <- intraspotCorrelation(MA, design)
> fit <- lmscFit(MA, design, correlation=corfit$consensus)
> 
> cont.matrix <- makeContrasts(pat-ctrl,levels=design)
> fit2 <- contrasts.fit(fit, cont.matrix)
> fit2 <- eBayes(fit2)
> 
> --------------
> 
> This analysis does not yield many significantly regulated genes (5 or 
> so), which is likely due to the small number of biological replicates 
> and the large diversity in the human population.
> 
> I also tried another approach, leaving the individual labels intact, and 
> fitting a linear model in the following manner:
> 
> --------------
> 
> cont.matrix <- makeContrasts((pat1+pat2+pat3+pat4-ctrl1-ctrl2-ctrl3-ctrl4)/4
>                 ,levels=design)
> 
> 
> --------------
> 
> Is this linear model valid?
> This analysis yields loads of significantly regulated genes (hundreds)! 
> Neither the MA plot or the M values in fit2 look suspicious, so I do not 
> have a reason to distrust this data.
> 
> I do not really understand why there is such a large discrepancy between 
> the two analysis methods. Is this due to the way replication is handled?
> 
> Could anybody comment on the validity of these two analyses, taking into 
> account individual variation in the human population, and the way 
> replication is handled in LIMMA?
> 
> Thank you very much,
> 
> Koen Bossers
> 
>