[BioC] DiffBind - overlap between different peak callers
Paolo Kunderfranco
paolo.kunderfranco at gmail.com
Thu Aug 9 09:34:26 CEST 2012
Dear Rory,
Thanks again for Your input, I went through and finally obtained
consensus peakset identified by both peak callers for
all the samples:
rm(list=ls())
library("DiffBind")
setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
test=dba(sampleSheet="database.csv")
#clustering based on occupancy/peak calling
pdf('clustering based on occupancy_peak calling.pdf')
dba.plotHeatmap(test)
dba.plotPCA(test)
dev.off()
par(mfrow=c(1,3))
pdf('VENN_H3K27me3.pdf')
dba.plotVenn(test, test$masks$ES &
test$masks$H3K27,label1='mES_m',label2='mES_s')
dba.plotVenn(test, test$masks$CMN &
test$masks$H3K27,label1='CMp_m',label2='CMp_s')
dba.plotVenn(test, test$masks$CMA &
test$masks$H3K27,label1='CMa_m',label2='CMa_s')
dev.off()
pdf('VENN_H3K4me3.pdf')
dba.plotVenn(test, test$masks$ES &
test$masks$H3K4,label1='mES_m',label2='mES_s')
dba.plotVenn(test, test$masks$CMN &
test$masks$H3K4,label1='CMp_m',label2='CMp_s')
dba.plotVenn(test, test$masks$CMA &
test$masks$H3K4,label1='CMa_m',label2='CMa_s')
dev.off()
pdf('VENN_H3K9me3.pdf')
dba.plotVenn(test, test$masks$ES &
test$masks$H3K9,label1='mES_m',label2='mES_s')
dba.plotVenn(test, test$masks$CMN &
test$masks$H3K9,label1='CMp_m',label2='CMp_s')
dba.plotVenn(test, test$masks$CMA &
test$masks$H3K9,label1='CMa_m',label2='CMa_s')
dev.off()
pdf('VENN_H3K79me2.pdf')
dba.plotVenn(test, test$masks$ES &
test$masks$H3K79,label1='mES_m',label2='mES_s')
dba.plotVenn(test, test$masks$CMN &
test$masks$H3K79,label1='CMp_m',label2='CMp_s')
dba.plotVenn(test, test$masks$CMA &
test$masks$H3K79,label1='CMa_m',label2='CMa_s')
dev.off()
test = dba.peakset(test, test$masks$ES & test$masks$H3K27,
minOverlap=2, sampID="mES_H3K27me3")
test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
minOverlap=2, sampID="CMp_H3K27me3")
test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
minOverlap=2, sampID="CMa_H3K27me3")
pdf('Overlap_H3K27me3.pdf')
dba.plotVenn(test, test$masks$H3K27 &
test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
dev.off()
test = dba.peakset(test, test$masks$ES & test$masks$H3K9,
minOverlap=2, sampID="mES_H3K9me3")
test = dba.peakset(test, test$masks$CMN & test$masks$H3K9,
minOverlap=2, sampID="CMp_H3K9me3")
test = dba.peakset(test, test$masks$CMA & test$masks$H3K9,
minOverlap=2, sampID="CMa_H3K9me3")
pdf('Overlap_H3K9me3.pdf')
dba.plotVenn(test, test$masks$H3K9 &
test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
dev.off()
test = dba.peakset(test, test$masks$ES & test$masks$H3K79,
minOverlap=2, sampID="mES_H3K79me2")
test = dba.peakset(test, test$masks$CMN & test$masks$H3K79,
minOverlap=2, sampID="CMp_H3K79me2")
test = dba.peakset(test, test$masks$CMA & test$masks$H3K79,
minOverlap=2, sampID="CMa_H3K79me2")
pdf('Overlap_H3K79me2.pdf')
dba.plotVenn(test, test$masks$H3K79 &
test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
dev.off()
test = dba.peakset(test, test$masks$ES & test$masks$H3K4,
minOverlap=2, sampID="mES_H3K4me3")
test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
minOverlap=2, sampID="CMp_H3K4me3")
test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
minOverlap=2, sampID="CMa_H3K4me3")
pdf('Overlap_H3K4me3.pdf')
dba.plotVenn(test, test$masks$H3K4 &
test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
dev.off()
consdata = dba(test, mask = test$masks$Consensus)
> consdata
12 Samples, 9754 sites in matrix (16143 total):
ID Tissue Factor Condition Replicate Intervals
1 mES_H3K27me3 ES H3K27 mES_H3K27me3 1-2 1333
2 CMp_H3K27me3 CMN H3K27 CMp_H3K27me3 1-2 911
3 CMa_H3K27me3 CMA H3K27 CMa_H3K27me3 1-2 316
4 mES_H3K9me3 ES H3K9 mES_H3K9me3 1-2 89
5 CMp_H3K9me3 CMN H3K9 CMp_H3K9me3 1-2 86
6 CMa_H3K9me3 CMA H3K9 CMa_H3K9me3 1-2 323
7 mES_H3K79me2 ES H3K79 mES_H3K79me2 1-2 8590
8 CMp_H3K79me2 CMN H3K79 CMp_H3K79me2 1-2 4641
9 CMa_H3K79me2 CMA H3K79 CMa_H3K79me2 1-2 1084
10 mES_H3K4me3 ES H3K4 mES_H3K4me3 1-2 10018
11 CMp_H3K4me3 CMN H3K4 CMp_H3K4me3 1-2 8748
12 CMa_H3K4me3 CMA H3K4 CMa_H3K4me3 1-2 2998
consdata = dba.count(consdata, minOverlap=1)
After I call dba.count I am prompted out R:
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::compare
Abort
Where am I wrong?
Cheers,
Paolo
2012/8/8 Rory Stark <Rory.Stark at cancer.org.uk>:
> Hello Paolo-
>
> To examine the overlaps between peak callers, you need to be working with
> the peak (occupancy) data BEFORE you settle on a global peakset used for
> counting and subsequent analysis. In your example below, you have this in
> "test" after calling "dba" but before calling "dba.count".
>
> So after
>
>> test=dba(sampleSheet="database.csv")
>
> you can check clustering based on occupancy/peak calling:
>
>> dba,plotHeatmap(test)
>> dba.plotPCA(test)
>
>
> You can also plot Venn diagrams of each of the peak caller overlaps -- eg
> for the H3K27me3 mark, you could see the three sample overlaps as follows:
>
>> par(mfrow=c(1,3))
>> dba.plotVenn(test, test$masks$ES & test$masks$H3K27)
>> dba.plotVenn(test, test$masks$CMN & test$masks$H3K27)
>
>> dba.plotVenn(test, test$masks$CMA & test$masks$H3K27)
>
>
> Note that specifying minOverlap=2 to dba.count doesn't limit the overlap
> to replicate peak callers. Rather, it creates a single set of peaks that
> are identified at least twice amongst all the peaksets: either by both
> peak callers for a single sample, or by the same peak caller in different
> samples, by different peak callers in different samples. If you want to
> use only peaks called by both peak callers for each sample,you can add a
> separate consensus peakset for eac condition, eg:
>
>> test = dba.peakset(test, test$masks$ES & test$masks$H3K27,
>>minOvrlap=2, ID="mES_H3K27me3")
>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>minOverlap=2, ID="CMp_H3K27e3")
>
>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>minOverlp=2, ID="CMa_H3K27me3")
>
>
> repeated 9 more times for a total of 12 pairs. You could if you like see
> the peak overlaps of the three H3K27me3 samples at this point:
>
>> dba.plotVenn(test, testmasks$H3K27 & test$masks$Consensus)
>
> Once you have added all 12 consensus peaksets, create a new DBA object
> containing only the consensus peaksets:
>
>> consdata = ba(test, mask = test$masks$Consensus)
>
> Then, if you want to use all the peaks identified by both peak callers for
> all the samples:
>
>> consdata = dba.count(consdata, minOverlap=1)
>
> And continue with the differential analysis as you had done.
>
> Cheers-
> Rory
>
>>
>>Dear All,
>>I am using package DiffBind,
>>Due to the fact that I miss real biological or technical replicates, I
>>am assesing the overlapping rates between different peak callers.
>>Here is how I set up my database to read in:
>>
>>3 Cell lines, 4 Factor, 2 Replicates, for a total of 12 Conditions
>>
>>> rm(list=ls())
>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>> test=dba(sampleSheet="database.csv")
>>> test
>>24 Samples, 19380 sites in matrix (40809 total):
>> ID Tissue Factor Condition Replicate Intervals
>>1 mES_H3K27me3_m ES H3K27 mES_H3K27me3 1 1834
>>2 CMp_H3K27me3_m CMN H3K27 CMp_H3K27me3 1 2450
>>3 CMa_H3K27me3_m CMA H3K27 CMa_H3K27me3 1 990
>>4 mES_H3K27me3_s ES H3K27 mES_H3K27me3 2 2188
>>5 CMp_H3K27me3_s CN H3K27 CMp_H3K27me3 2 3388
>>6 CMa_H3K27me3_s CMA H3K27 CMa_H3K27me3 2 5946
>>7 mES_H3K4me3_m ES H3K4 mES_H3K4me3 1 10243
>>8 CMp_H3K4me3_m CMN H3K4 CMp_H3K4me3 1 12476
>>9 CMa_H3K4me3_m CMA H3K4 CMa_H3K4me3 1 5632
>>10 mES_H3K4m3_s ES H3K4 mES_H3K4me3 2 14917
>>11 CMp_H3K4me3_s CMN H3K4 CMp_H3K4me3 2 10646
>>12 CMa_H3K4me3_s CMA H3K4 CMa_H3K4me3 2 5985
>>13 mES_H3K9me3_m ES H3K9 mES_H3K9me3 1 110
>>14 CMp_H3K9me3_m CMN H3K9 CMp_H3K9me3 1 484
>>15 CMa_H3K9me3_m CMA H3K9 CMa_H3K9me3 1 938
>>16 mES_H3K9me3_s ES H3K9 mES_H3K9me3 2 569
>>17 CMp_H3K9me3_s CMN H3K9 CMp_H3K9me3 2 808
>>18 CMa_H3K9me3_s CMA H3K9 CMa_H3K9me3 2 3747
>>19 mES_H3K79me2_m ES H3K79 mES_H3K79me2 1 21318
>>20 CMp_H3K79me2_m CMN H3K79 CMp_H3K79me2 1 13210
>>21 CMa_H3K79me2_m CMA H3K79 CMa_H3K79me2 1 2988
>>22 mES_H3K79me2_s ES H3K79 mES_H3K79me2 2 22164
>>23 CMp_H3K79me2_s CMN H3K79 CMp_H3K79me2 2 13429
>>24 CMa_H3K79me2_s CMA H3K79 CMa_H3K79me2 2 6063
>>
>>
>>> test=dba.count(test)
>>
>>> test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>>#minMemebers was se to 2 due to
>>the fact I have 2 replicates for each condition
>>#....78 Contrasts:
>>
>>> test = dba.analyze(test)
>>
>>#for each condition generate a report
>>>test.DB1=dba.report(test,contrast=1)
>>
>>
>>I would like now to assess overlapping rates between replicates, and
>>generate a venn diagram.
>>but it seems like replicates are merged together and I cannot treat
>>them as single samples.
>>I noticed this behaviour also when i try to generate both a PCA plot
>>and a heatmap plot, replicates are merged, so
>>what's the utility to display both in the PCA or heatmap plot?
>>would it be simple to display the original replicates to assess how do
>>they cluster?
>>
>>>pdf('PCA_plot_DBA_CONDITION.pdf')
>>>dba.plotPCA(test,attributes=DBA_CONDITION,
>>>vColors=c("grey0","grey23","grey47","antiquewhite4","antiquewhite3","anti
>>>quewhite1","dodgerblue4","dodgerblue","darkslategray1","brown4","brown","
>>>coral"))
>>>dev.off()
>>
>>>pdf('HeatMap.pdf')
>>>corvals=dba.plotHeatmap(test)
>>>dev.off()
>>
>>>pdf('HeatMap_RPKM_FOLD.pdf')
>>>corvals=dba.plotHeatmap(test,score=DBA_SCORE_RPKM_FOLD)
>>>dev.off()
>>
>>> sessionInfo()
>>R version 2.15.1 (2012-06-22)
>>Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>[11] C_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>attached base packages:
>>[1] parallel stats graphics grDevices utils datasets methods
>>[8] base
>>
>>other attached packages:
>>[1] DiffBind_1.2.0 Biobase_2.16.0 GenomicRanges_1.8.9
>>[4] IRanges_1.14.4 BiocGenerics_0.2.0
>>
>>loaded via a namespace (and not attached):
>> [1] amap_0.8-7 edgeR_2.6.10 gdata_2.11.0
>>gplots_2.11.0
>> [5] gtools_2.7.0 limma_3.12.1 RColorBrewer_1.0-5
>>stats4_2.15.1
>> [9] tools_2.15.1 zlibbioc_1.2.0
>>
>>
>>
>>Any sugestion is appreciated,
>>cheers,
>>paolo
>>
>>
More information about the Bioconductor
mailing list