[BioC] Interpretation of fold.changes for covariates in a linear model using voom

Fri Apr 18 02:01:31 CEST 2014

Dear Francesco,

I think that you may be mis-judging the nature of a linear model with 
multiple factors and covariates.

Limma computes log-fold-changes by fitting a linear model and extracting 
the estimated coefficients.  The interpretation of each coefficient is 
that it represents the change in the reponse (log-expression in this case) 
for each unit change in the covariate, assuming that all other covariates 
in the model can be held constant.

For example, the coefficient for alpha in your hypothetical model is the 
change in the log-expression of the gene that would occur if alpha was 
changed (from 0 to 1) but all other factors in the model were held 
constant.

This is not the same as fitting a simple model with just alpha and no 
other factors.  The coefficient that you would get from the simple model 
with just alpha will not be the same as the coefficient you would get for 
alpha from the full model.  In the simple model, the other factors are not 
held constant.  It would easily be that the full model could give a 
negative coefficient but the simple model gives a positive coefficient. 
This occurs when there are other factors in the model that are positively 
correlated with alpha but have the opposite effect on expression.

When you make a simple summary of log-expression values for wild-type KRAS 
vs. mutated, you are looking at the marginal effect of KRAS when all other 
factors are allowed to vary as well.  In other words, you are examining 
the simple model with just KRAS as a factor.  There is no reason that this 
should be the same as the logFC for KRAS when all other factors have been 
adjusted for.  The simple effect could easily be driven by factors other 
than KRAS.

You might find it helpful to consult a textbook on multiple regression 
that discusses the subtleties of unbalanced anova or correlated 
covariates.

Best wishes
Gordon

> Date: Mon, 14 Apr 2014 06:22:45 -0700 (PDT)
> From: "Francesco Gatto [guest]" <guest at bioconductor.org>
> To: bioconductor at r-project.org, gatto at chalmers.se
> Subject: [BioC] Interpretation of fold.changes for covariates in a
> 	linear	model using voom
>
> I have implemented a linear model to fit RNAseq read counts for ~1000 
> samples using ~200 covariates plus the intercept. The pipeline follows 
> the example in the user manual, i.e. v=voom(y,design) -> 
> fit=lmFit(v,design) -> fit2=eBayes(fit). Then I am using decideTests() 
> to assess the effect of each covariate on gene expression. However, the 
> output is less complete than topTable(), so I used this function to 
> obtain values on fold-changes and individual p-values. At this stage, 
> values become difficult to interpret. If the LM is:
>
> gene_i,j = intercept_i + alfa_i*Factor1_j + beta_i*Factor2_j + (~200 other factors)
>
> and, to simplify, let's say covariates are binary. Then what do the 
> fold-changes for topTable(fit2,coef="alfa") mean?

> My interpretation was the up/down-regulation due to the presence of 
> covariate alfa, but if I boxplot or summarize the expression values for 
> the two groups (alfa = 1 vs alfa = 0) I get quite contradictory results. 
> For example this is the most DE gene when the sample has a mutation in 
> KRAS (the covariate, 0 if wild-type, 1 if mutated):
>
>> topTable(fit2,coef="metadataKRAS",number=1)
>
> hgnc_symbol     logFC    AveExpr         t      P.Value    adj.P.Val        B
> GOLGA6A -1.314213 -5.9000494 -7.375175 3.800569e-13 7.204358e-09 18.80961
>
> The summary of gene expression values for wild-type KRAS vs. mutated is:
>
>> by(E["GOLGA6A",],design[,"metadataKRAS"],summary)
>
> design[, "metadataKRAS"]: 0
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> -7.882  -6.865  -6.461  -6.021  -5.770   1.548
> ----------------------------------------------------------------
> design[, "metadataKRAS"]: 1
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> -7.657  -6.247  -5.487  -5.239  -4.582  -1.311
>
> This is an up-regulation, and of magnitude way lower than the one 
> indicated by topTable. Is this because I am misusing the function 
> topTable? I cannot wrap my head around this. Thanks for your help!
>
> -- output of sessionInfo():
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] edgeR_3.4.2  limma_3.18.9
>
> loaded via a namespace (and not attached):
> [1] tools_3.0.2

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}