[BioC] Confirming a batch effect has been removed
Hoyles, Lesley
lesley.hoyles11 at imperial.ac.uk
Thu Aug 9 17:36:06 CEST 2012
Hi Jim
Thanks so much for the feedback. It's been very helpful.
>The only issue would arise if say score_1 samples are only found in
>batch 1 and score_0 samples are only found in batch 2.
Thankfully, I don't have that problem with my data.
Best wishes
Lesley
________________________________________
From: James W. MacDonald [jmacdon at uw.edu]
Sent: 09 August 2012 16:04
To: Hoyles, Lesley
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Confirming a batch effect has been removed
Hi Lesley,
On 8/9/2012 10:35 AM, Hoyles, Lesley wrote:
> Hi
>
> I am working with single-channel Agilent data. I have two batches of microarrays for biopsies from different patients (batch 1, 25 arrays; batch 2, 34 arrays). There are no replicate samples for any of the patients in either of the batches. The patients are classified according to disease score. Using MDS (and confirmed with PCA) I have found there is a pronounced batch effect.
>
> Initially, I tried removing the batch effect using METHOD 1. Using this method I found I was getting 1000s rather than the expected 10s or 100s of genes being differentially expressed. However, I was able to demonstrate with MDS that I had eliminated the batch effect as I was able to plot yave. Using METHOD 2, I got the number of genes I expected to see differentially expressed (based on what I'd seen on analyses with the batches treated as separate entities).
>
> My main questions are:
> How can I show I have removed the batch effect from the dataset using METHOD 2, as I have not made any changes to the expression matrix and have incorporated the batch effect into the model?
I'm not sure you need to show anything. It is not controversial to use a
blocking factor to account for batches.
The only issue would arise if say score_1 samples are only found in
batch 1 and score_0 samples are only found in batch 2. If you have that
sort of situation, then the biological and technical differences are
confounded, and you have real problems, as it is not possible to say if
any differences are biological and not technical.
>
> What value of fit2$df.prior is seen as demonstrating more significant differential expression? I have seen on a list answer that 'A larger value will result in more significant differential expression, and a small value will result in less differential expression'.
>
> I'd also be grateful of an explanation as to why I'm seeing such low adjusted P values using METHOD 1.
Well, you shouldn't be using METHOD 1 in the first place. From the help
page for removeBatchEffects():
This function is useful for removing batch effects, associated
with hybridization time or other technical variables, prior to
clustering or unsupervised analysis such as PCA or heatmaps. It is
not intended to use with linear modelling. For linear modelling,
it is better to include the batch factors in the linear model.
Best,
Jim
>
> Thanks in advance for your assistance.
>
> Best wishes
> Lesley
>
>
> METHOD 1
> ########
> condition.f<- factor(targets$score_score, levels = unique(targets$score_score))
> batch.f<- factor(targets$Batch, levels = unique(targets$Batch))
> design<- model.matrix(~0+condition.f)
> y0.batch<- removeBatchEffect(y0, batch.f, design=design)
> colnames(design)<- levels(condition.f)
> yave<- avereps(y0.batch, ID=y0$genes[,"GeneName"])
> fit<- lmFit(yave, design)
> contrast.matrix<- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, score_5-score_0, score_5-score_1, score_5-score_2, score_5-score_3, levels=design)
> contrast.matrix<- makeContrasts(score_1-score_0, score_2-score_0, score_3-score_0, score_2-score_1, score_3-score_1, score_3-score_2, levels=design)
> fit2<- contrasts.fit(fit, contrast.matrix)
> fit2<- eBayes(fit2)
> fit2$df.prior
> fit2$s2.prior
>
>
> Summary results from normalized data with controls included
> score_1 - score_0 score_2 - score_0 score_3 - score_0 score_2 - score_1
> -1 2556 1736 726 0
> 0 26331 27969 29127 30574
> 1 1687 869 721 0
> score_3 - score_1 score_3 - score_2 score_5 - score_0 score_5 - score_1
> -1 404 0 879 12
> 0 29323 30572 29401 30492
> 1 847 2 294 70
> score_5 - score_2 score_5 - score_3
> -1 29 123
> 0 30463 30365
> 1 82 86
>
>
> METHOD 2
> ########
> condition.f<- factor(targets$score_score, levels = unique(targets$score_score))
> batch.f<- factor(targets$Batch, levels = unique(targets$Batch))
> design<- model.matrix(~0+condition.f+batch.f)
> colnames(design)<- c("score_3", "score_2", "score_1", "score_0", "score_5", "Batch")
> y.ave<- avereps(y, ID=y$genes[,"GeneName"])
> fit<- lmFit(y.ave, design)
> write.table(fit, file = "QC/Normalized_controls_included/Faulty_removed_batch/Faulty_removed_batch_fit.txt", sep = "\t")
> contrast.matrix<- makeContrasts(a = score_1-score_0, b = score_2-score_0, c = score_3-score_0, d = score_2-score_1, e = score_3-score_1, f = score_3-score_2, g = score_5-score_0, h = score_5-score_1, i = score_5-score_2, j = score_5-score_3, levels=design)
> fit2<- contrasts.fit(fit, contrast.matrix)
> fit2<- eBayes(fit2)
> fit2$df.prior
> fit2$s2.prior
>
> Summary results from normalized data with controls included
> a b c d e f g h i j
> -1 0 0 23 0 1 0 5 6 3 6
> 0 30574 30574 30493 30574 30564 30574 30512 30499 30503 30506
> 1 0 0 58 0 9 0 57 69 68 62
>
> ##########################################################
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
> [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] limma_3.12.1
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list