[BioC] sequential removeBatchEffect()?

Gordon K Smyth smyth at wehi.EDU.AU
Fri Jul 1 01:39:08 CEST 2011


Hi Jenny,

I don't think I'd recommend the sequential approach, which might give 
non-optimal results if the two batch factors are highly correlated with 
one another.

In fact, I'd be tempted to remove only the pop batch effect, since the 
BeadChip effect is relatively small.

Removing both batch effects requires that you have enough data to estimate 
all your experimental treatments reliably as well as the two batch 
effects.  If you want to do this, here is a function that will remove two 
batch effects at once:

removeBatchEffect <- function(x,batch,batch2=NULL,design=matrix(1,ncol(x),1))
{
  x <- as.matrix(x)
  batch <- as.factor(batch)
  contrasts(batch) <- contr.sum(levels(batch))
  if(is.null(batch2)) {
   X <- model.matrix(~batch)[,-1,drop=FALSE]
  } else {
   batch2 <- as.factor(batch2)
   contrasts(batch2) <- contr.sum(levels(batch2))
   X <- model.matrix(~batch+batch2)[,-1,drop=FALSE]
  }
  X <- qr.resid(qr(design),X)
  qrX <- qr(X)
  t(qr.resid(qrX,t(x)))
}

I'll replace removeBatchEffect() with this in limma on Bioc-devel.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Thu, 30 Jun 2011, Jenny Drnevich wrote:

> HI Gordon,
>
> No, Sample_Group isn't confounded with BeadChip. Our microarray unit is very 
> good about asking customers about the treatment groups and randomizing them 
> on arrays. I use GenomeStudio to output summarized beadtype values (no bg or 
> norm), and GenomeStudio always re-organizes the by Sample_Group. I've learned 
> to double-check that the order in my targets file is the same as the order 
> coming out of GenomeStudio!
>
> Jenny
>
> At 06:08 PM 6/29/2011, Gordon K Smyth wrote:
>> Hi Jenny,
>> 
>> Perhaps I'm mis-understanding your setup, but isn't Sample_Group confounded 
>> with BeadChip?  If the samples are entered in BeadChip order, then the 
>> first six would be BeadChip 1 (and also Cont.8), the second six would be 
>> BeadChip 2 (and also DHT.8), etc.  Or have the samples been reordered by 
>> Sample_Group?
>> 
>> Best wishes
>> Gordon
>> 
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> smyth at wehi.edu.au
>> http://www.wehi.edu.au
>> http://www.statsci.org/smyth
>> 
>> On Wed, 29 Jun 2011, Jenny Drnevich wrote:
>> 
>>> Hi Gordon and everyone,
>>> 
>>> I was wondering what is the best way to remove two different batch effects 
>>> from a set of sample values for plotting heatmaps? I have data from 
>>> Illumina's MouseWG6 arrays, 24 samples on 4 BeadChips. There is the effect 
>>> of BeadChip, which I'm accounting for in the model using 
>>> duplicateCorrelation(), and also population batch effect, which I have as 
>>> a term in the design matrix. I'd like to remove both of these effects from 
>>> the individual sample data before making heatmaps of significant genes. 
>>> Should I use removeBatchEffect() twice, first to remove the BeadChip 
>>> effect (with population in the design) and then remove the population 
>>> effect? Here's an example of my analysis code and what I'm thinking of 
>>> doing to remove the two effect. Please let me know if this is alright, or 
>>> if I should do something different:
>>> 
>>>> design <- model.matrix(~0+ factor(targets$Sample_Group))
>>>> colnames(design) <- levels(factor(targets$Sample_Group))
>>>> design <- cbind(design,pop=as.numeric(targets$pop==1))
>>>> design
>>>   Cont.24 Cont.8 DHT.24 DHT.8 pop
>>> 1        0      1      0     0   1
>>> 2        0      1      0     0   0
>>> 3        0      1      0     0   0
>>> 4        0      1      0     0   1
>>> 5        0      1      0     0   1
>>> 6        0      1      0     0   0
>>> 7        0      0      0     1   1
>>> 8        0      0      0     1   0
>>> 9        0      0      0     1   0
>>> 10       0      0      0     1   1
>>> 11       0      0      0     1   0
>>> 12       0      0      0     1   1
>>> 13       1      0      0     0   1
>>> 14       1      0      0     0   0
>>> 15       1      0      0     0   1
>>> 16       1      0      0     0   0
>>> 17       1      0      0     0   0
>>> 18       1      0      0     0   1
>>> 19       0      0      1     0   1
>>> 20       0      0      1     0   0
>>> 21       0      0      1     0   1
>>> 22       0      0      1     0   0
>>> 23       0      0      1     0   0
>>> 24       0      0      1     0   1
>>> 
>>>> duplicateCorrelation(BSlimma.neqc, design,
>>> block=as.numeric(factor(targets$Sentrix_ID)))$consensus
>>> [1]  0.1528686
>>> 
>>>> fit.neqc <- lmFit(BSlimma.neqc, design,
>>> block=as.numeric(factor(targets$Sentrix_ID)), correlation=0.1528686)
>>> 
>>>> cont.matrix <- makeContrasts(DHT.8_v_Cont.8 = DHT.8 - Cont.8,
>>> DHT.24_v_Cont.24 = DHT.24 - Cont.24,
>>> +                             Cont.24_v_Cont.8 = Cont.24 - Cont.8, 
>>> DHT.24_v_DHT.8 = DHT.24 - DHT.8,
>>> +                             interact = (DHT.24 - Cont.24) - (DHT.8 - 
>>> Cont.8),levels=design)
>>> 
>>>> fit.neqc2 <- eBayes(contrasts.fit(fit.neqc,cont.matrix))
>>> 
>>> # This is what I _think_ I should do to remove both batch effects from the 
>>> sample data for heatmaps:
>>> 
>>>> noChip.values <- removeBatchEffect(BSlimm.neqc$E,
>>> batch=as.numeric(factor(targets$Sentrix_ID)), design=design)
>>> 
>>>> noChip.noPop.values <- removeBatchEffect(noChip.values, 
>>>> batch=targets$pop,
>>> design = design[,1:4] )
>>> 
>>> Then I can use noChip.noPop.values for heatmaps?
>>> 
>>> Thanks,
>>> Jenny
>>> 
>>>> sessionInfo()
>>> R version 2.13.0 (2011-04-13)
>>> Platform: x86_64-pc-mingw32/x64 (64-bit)
>>> 
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
>>> States.1252    LC_MONETARY=English_United States.1252
>>> [4] LC_NUMERIC=C                           LC_TIME=English_United 
>>> States.1252
>>> 
>>> attached base packages:
>>> [1] splines   grid      stats     graphics  grDevices datasets  utils 
>>> methods   base
>>> 
>>> other attached packages:
>>> [1] statmod_1.4.9         beadarray_2.2.0       limma_3.8.2 WGCNA_1.00 
>>> Hmisc_3.8-3
>>> [6] survival_2.36-5       qvalue_1.26.0         flashClust_1.01 
>>> dynamicTreeCut_1.21   impute_1.26.0
>>> [11] made4_1.26.0          scatterplot3d_0.3-33  gplots_2.8.0 caTools_1.11 
>>> bitops_1.0-4.1
>>> [16] gdata_2.8.1           gtools_2.6.2          RColorBrewer_1.0-2 
>>> ade4_1.4-17           affyPLM_1.28.5
>>> [21] preprocessCore_1.14.0 gcrma_2.24.1          affycoretools_1.24.0 
>>> KEGG.db_2.5.0 GO.db_2.5.0
>>> [26] RSQLite_0.9-4         DBI_0.2-5             AnnotationDbi_1.14.1 
>>> affy_1.30.0           Biobase_2.12.1
>>> [31] RWinEdt_1.8-2
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] affyio_1.20.0     annaffy_1.24.0    annotate_1.30.0   biomaRt_2.8.1 
>>> Biostrings_2.20.1 Category_2.18.0   cluster_1.13.3
>>> [8] genefilter_1.34.0 GOstats_2.18.0    graph_1.30.0      GSEABase_1.14.0 
>>> IRanges_1.10.4   lattice_0.19-23   RBGL_1.28.0
>>> [15] RCurl_1.6-6.1     tcltk_2.13.0      tools_2.13.0      XML_3.4-0.2 
>>> xtable_1.5-6
>>> 
>>> 
>> 
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the 
>> addressee.
>> You must not disclose, forward, print or use it without the permission of 
>> the sender.
>> ______________________________________________________________________
>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list