[BioC] Confirming a batch effect has been removed

Thu Aug 9 17:36:06 CEST 2012

Hi Jim

Thanks so much for the feedback. It's been very helpful.

>The only issue would arise if say score_1 samples are only found in
>batch 1 and score_0 samples are only found in batch 2.

Thankfully, I don't have that problem with my data.

Best wishes
Lesley

________________________________________
From: James W. MacDonald [jmacdon at uw.edu]
Sent: 09 August 2012 16:04
To: Hoyles, Lesley
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Confirming a batch effect has been removed

Hi Lesley,

On 8/9/2012 10:35 AM, Hoyles, Lesley wrote:
> Hi
>
> I am working with single-channel Agilent data. I have two batches of microarrays for biopsies from different patients (batch 1, 25 arrays; batch 2, 34 arrays). There are no replicate samples for any of the patients in either of the batches. The patients are classified according to disease score. Using MDS (and confirmed with PCA) I have found there is a pronounced batch effect.
>
> Initially, I tried removing the batch effect using METHOD 1. Using this method I found I was getting 1000s rather than the expected 10s or 100s of genes being differentially expressed. However, I was able to demonstrate with MDS that I had eliminated the batch effect as I was able to plot yave. Using METHOD 2, I got the number of genes I expected to see differentially expressed (based on what I'd seen on analyses with the batches treated as separate entities).
>
> My main questions are:
> How can I show I have removed the batch effect from the dataset using METHOD 2, as I have not made any changes to the expression matrix and have incorporated the batch effect into the model?

I'm not sure you need to show anything. It is not controversial to use a
blocking factor to account for batches.

The only issue would arise if say score_1 samples are only found in
batch 1 and score_0 samples are only found in batch 2. If you have that
sort of situation, then the biological and technical differences are
confounded, and you have real problems, as it is not possible to say if
any differences are biological and not technical.

>
> What value of fit2$df.prior is seen as demonstrating more significant differential expression? I have seen on a list answer that 'A larger value will result in more significant differential expression, and a small value will result in less differential expression'.
>
> I'd also be grateful of an explanation as to why I'm seeing such low adjusted P values using METHOD 1.

Well, you shouldn't be using METHOD 1 in the first place. From the help
page for removeBatchEffects():

This function is useful for removing batch effects, associated
      with hybridization time or other technical variables, prior to
      clustering or unsupervised analysis such as PCA or heatmaps. It is
      not intended to use with linear modelling. For linear modelling,
      it is better to include the batch factors in the linear model.

Best,

Jim

>
> Thanks in advance for your assistance.
>
> Best wishes
> Lesley
>
>
> METHOD 1
> ########
> condition.f<- factor(targets$score_score, levels = unique(targets$score_score))
> batch.f<- factor(targets$Batch, levels = unique(targets$Batch))
> design<- model.matrix(~0+condition.f)
> y0.batch<- removeBatchEffect(y0, batch.f, design=design)
> colnames(design)<- levels(condition.f)
> yave<- avereps(y0.batch, ID=y0$genes[,"GeneName"])
> fit<- lmFit(yave, design)
> contrast.matrix<- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, score_5-score_0, score_5-score_1, score_5-score_2, score_5-score_3, levels=design)
> contrast.matrix<- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, levels=design)
> fit2<- contrasts.fit(fit, contrast.matrix)
> fit2<- eBayes(fit2)
> fit2$df.prior
> fit2$s2.prior
>
>
> Summary results from normalized data with controls included
>     score_1 - score_0 score_2 - score_0 score_3 - score_0 score_2 - score_1
> -1              2556              1736               726                 0
> 0              26331             27969             29127             30574
> 1               1687               869               721                 0
>     score_3 - score_1 score_3 - score_2 score_5 - score_0 score_5 - score_1
> -1               404                 0               879                12
> 0              29323             30572             29401             30492
> 1                847                 2               294                70
>     score_5 - score_2 score_5 - score_3
> -1                29               123
> 0              30463             30365
> 1                 82                86
>
>
> METHOD 2
> ########
> condition.f<- factor(targets$score_score, levels = unique(targets$score_score))
> batch.f<- factor(targets$Batch, levels = unique(targets$Batch))
> design<- model.matrix(~0+condition.f+batch.f)
> colnames(design)<- c("score_3", "score_2", "score_1", "score_0", "score_5", "Batch")
> y.ave<- avereps(y, ID=y$genes[,"GeneName"])
> fit<- lmFit(y.ave, design)
> write.table(fit, file = "QC/Normalized_controls_included/Faulty_removed_batch/Faulty_removed_batch_fit.txt", sep = "\t")
> contrast.matrix<- makeContrasts(a = score_1-score_0, b = score_2-score_0, c = score_3-score_0, d = score_2-score_1, e = score_3-score_1, f = score_3-score_2, g = score_5-score_0, h = score_5-score_1, i = score_5-score_2, j = score_5-score_3, levels=design)
> fit2<- contrasts.fit(fit, contrast.matrix)
> fit2<- eBayes(fit2)
> fit2$df.prior
> fit2$s2.prior
>
> Summary results from normalized data with controls included
>         a     b     c     d     e     f     g     h     i     j
> -1     0     0    23     0     1     0     5     6     3     6
> 0  30574 30574 30493 30574 30564 30574 30512 30499 30503 30506
> 1      0     0    58     0     9     0    57    69    68    62
>
> ##########################################################
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
>   [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
>   [7] LC_PAPER=C                LC_NAME=C
>   [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] limma_3.12.1
>
>       [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099