[BioC] DiffBind - overlap between different peak callers

Tue Aug 14 16:51:31 CEST 2012

Hi,
Any idea on how to solve this problem? I am stuck just at the end of
the project,
Cheers,
Paolo

> After I call dba.count I am prompted out R:
>
> terminate called after throwing an instance of 'std::out_of_range'
>   what():  basic_string::compare
> Abort
>

2012/8/9 Paolo Kunderfranco <paolo.kunderfranco at gmail.com>:
> Dear Rory,
> Thanks again for Your input, I went through and finally obtained
> consensus peakset  identified by both peak callers for
> all the samples:
>
>
>
> rm(list=ls())
> library("DiffBind")
> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>
> test=dba(sampleSheet="database.csv")
> #clustering based on occupancy/peak calling
> pdf('clustering based on occupancy_peak calling.pdf')
> dba.plotHeatmap(test)
> dba.plotPCA(test)
> dev.off()
>
> par(mfrow=c(1,3))
>
> pdf('VENN_H3K27me3.pdf')
> dba.plotVenn(test, test$masks$ES  &
> test$masks$H3K27,label1='mES_m',label2='mES_s')
> dba.plotVenn(test, test$masks$CMN &
> test$masks$H3K27,label1='CMp_m',label2='CMp_s')
> dba.plotVenn(test, test$masks$CMA &
> test$masks$H3K27,label1='CMa_m',label2='CMa_s')
> dev.off()
> pdf('VENN_H3K4me3.pdf')
> dba.plotVenn(test, test$masks$ES  &
> test$masks$H3K4,label1='mES_m',label2='mES_s')
> dba.plotVenn(test, test$masks$CMN &
> test$masks$H3K4,label1='CMp_m',label2='CMp_s')
> dba.plotVenn(test, test$masks$CMA &
> test$masks$H3K4,label1='CMa_m',label2='CMa_s')
> dev.off()
> pdf('VENN_H3K9me3.pdf')
> dba.plotVenn(test, test$masks$ES  &
> test$masks$H3K9,label1='mES_m',label2='mES_s')
> dba.plotVenn(test, test$masks$CMN &
> test$masks$H3K9,label1='CMp_m',label2='CMp_s')
> dba.plotVenn(test, test$masks$CMA &
> test$masks$H3K9,label1='CMa_m',label2='CMa_s')
> dev.off()
> pdf('VENN_H3K79me2.pdf')
> dba.plotVenn(test, test$masks$ES  &
> test$masks$H3K79,label1='mES_m',label2='mES_s')
> dba.plotVenn(test, test$masks$CMN &
> test$masks$H3K79,label1='CMp_m',label2='CMp_s')
> dba.plotVenn(test, test$masks$CMA &
> test$masks$H3K79,label1='CMa_m',label2='CMa_s')
> dev.off()
> test = dba.peakset(test, test$masks$ES  & test$masks$H3K27,
> minOverlap=2, sampID="mES_H3K27me3")
> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
> minOverlap=2, sampID="CMp_H3K27me3")
> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
> minOverlap=2, sampID="CMa_H3K27me3")
> pdf('Overlap_H3K27me3.pdf')
> dba.plotVenn(test, test$masks$H3K27 &
> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
> dev.off()
> test = dba.peakset(test, test$masks$ES  & test$masks$H3K9,
> minOverlap=2, sampID="mES_H3K9me3")
> test = dba.peakset(test, test$masks$CMN & test$masks$H3K9,
> minOverlap=2, sampID="CMp_H3K9me3")
> test = dba.peakset(test, test$masks$CMA & test$masks$H3K9,
> minOverlap=2, sampID="CMa_H3K9me3")
> pdf('Overlap_H3K9me3.pdf')
> dba.plotVenn(test, test$masks$H3K9 &
> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
> dev.off()
> test = dba.peakset(test, test$masks$ES  & test$masks$H3K79,
> minOverlap=2, sampID="mES_H3K79me2")
> test = dba.peakset(test, test$masks$CMN & test$masks$H3K79,
> minOverlap=2, sampID="CMp_H3K79me2")
> test = dba.peakset(test, test$masks$CMA & test$masks$H3K79,
> minOverlap=2, sampID="CMa_H3K79me2")
> pdf('Overlap_H3K79me2.pdf')
> dba.plotVenn(test, test$masks$H3K79 &
> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
> dev.off()
> test = dba.peakset(test, test$masks$ES  & test$masks$H3K4,
> minOverlap=2, sampID="mES_H3K4me3")
> test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
> minOverlap=2, sampID="CMp_H3K4me3")
> test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
> minOverlap=2, sampID="CMa_H3K4me3")
> pdf('Overlap_H3K4me3.pdf')
> dba.plotVenn(test, test$masks$H3K4 &
> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
> dev.off()
>
> consdata = dba(test, mask = test$masks$Consensus)
>> consdata
> 12 Samples, 9754 sites in matrix (16143 total):
>              ID Tissue Factor    Condition Replicate Intervals
> 1  mES_H3K27me3     ES  H3K27 mES_H3K27me3       1-2      1333
> 2  CMp_H3K27me3    CMN  H3K27 CMp_H3K27me3       1-2       911
> 3  CMa_H3K27me3    CMA  H3K27 CMa_H3K27me3       1-2       316
> 4   mES_H3K9me3     ES   H3K9  mES_H3K9me3       1-2        89
> 5   CMp_H3K9me3    CMN   H3K9  CMp_H3K9me3       1-2        86
> 6   CMa_H3K9me3    CMA   H3K9  CMa_H3K9me3       1-2       323
> 7  mES_H3K79me2     ES  H3K79 mES_H3K79me2       1-2      8590
> 8  CMp_H3K79me2    CMN  H3K79 CMp_H3K79me2       1-2      4641
> 9  CMa_H3K79me2    CMA  H3K79 CMa_H3K79me2       1-2      1084
> 10  mES_H3K4me3     ES   H3K4  mES_H3K4me3       1-2     10018
> 11  CMp_H3K4me3    CMN   H3K4  CMp_H3K4me3       1-2      8748
> 12  CMa_H3K4me3    CMA   H3K4  CMa_H3K4me3       1-2      2998
>
> consdata = dba.count(consdata, minOverlap=1)
>
> After I call dba.count I am prompted out R:
>
> terminate called after throwing an instance of 'std::out_of_range'
>   what():  basic_string::compare
> Abort
>
> Where am I wrong?
> Cheers,
> Paolo
>
>
>
>
>
>
> 2012/8/8 Rory Stark <Rory.Stark at cancer.org.uk>:
>> Hello Paolo-
>>
>> To examine the overlaps between peak callers, you need to be working with
>> the peak (occupancy) data BEFORE you settle on a global peakset used for
>> counting and subsequent analysis. In your example below, you have this in
>> "test" after calling "dba" but before calling "dba.count".
>>
>> So after
>>
>>> test=dba(sampleSheet="database.csv")
>>
>> you can check clustering based on occupancy/peak calling:
>>
>>> dba,plotHeatmap(test)
>>> dba.plotPCA(test)
>>
>>
>> You can also plot Venn diagrams of each of the peak caller overlaps -- eg
>> for the H3K27me3 mark, you could see the three sample overlaps as follows:
>>
>>> par(mfrow=c(1,3))
>>> dba.plotVenn(test, test$masks$ES  & test$masks$H3K27)
>>> dba.plotVenn(test, test$masks$CMN & test$masks$H3K27)
>>
>>> dba.plotVenn(test, test$masks$CMA & test$masks$H3K27)
>>
>>
>> Note that specifying minOverlap=2 to dba.count doesn't limit the overlap
>> to replicate peak callers. Rather, it creates a single set of peaks that
>> are identified at least twice amongst all the peaksets: either by both
>> peak callers for a single sample, or by the same peak caller in different
>> samples,  by different peak callers in different samples. If you want to
>> use only peaks called by both peak callers for each sample,you can add a
>> separate consensus peakset for eac condition, eg:
>>
>>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K27,
>>>minOvrlap=2, ID="mES_H3K27me3")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>>minOverlap=2, ID="CMp_H3K27e3")
>>
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>>minOverlp=2, ID="CMa_H3K27me3")
>>
>>
>> repeated 9 more times for a total of 12 pairs. You could if you like see
>> the peak overlaps of the three H3K27me3 samples at this point:
>>
>>> dba.plotVenn(test, testmasks$H3K27 & test$masks$Consensus)
>>
>> Once you have added all 12 consensus peaksets, create a new DBA object
>> containing only the consensus peaksets:
>>
>>> consdata = ba(test, mask = test$masks$Consensus)
>>
>> Then, if you want to use all the peaks identified by both peak callers for
>> all the samples:
>>
>>> consdata = dba.count(consdata, minOverlap=1)
>>
>> And continue with the differential analysis as you had done.
>>
>> Cheers-
>> Rory
>>
>>>
>>>Dear All,
>>>I am using package DiffBind,
>>>Due to the fact that I miss real biological or technical replicates, I
>>>am assesing the overlapping rates between different peak callers.
>>>Here is how I set up my database to read in:
>>>
>>>3 Cell lines, 4 Factor, 2 Replicates, for a total of 12 Conditions
>>>
>>>> rm(list=ls())
>>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>>> test=dba(sampleSheet="database.csv")
>>>> test
>>>24 Samples, 19380 sites in matrix (40809 total):
>>>                  ID Tissue Factor    Condition Replicate Intervals
>>>1  mES_H3K27me3_m     ES  H3K27 mES_H3K27me3         1      1834
>>>2  CMp_H3K27me3_m    CMN  H3K27 CMp_H3K27me3         1      2450
>>>3  CMa_H3K27me3_m    CMA  H3K27 CMa_H3K27me3         1       990
>>>4  mES_H3K27me3_s     ES  H3K27 mES_H3K27me3         2      2188
>>>5  CMp_H3K27me3_s    CN  H3K27 CMp_H3K27me3         2      3388
>>>6  CMa_H3K27me3_s    CMA  H3K27 CMa_H3K27me3         2      5946
>>>7   mES_H3K4me3_m     ES   H3K4  mES_H3K4me3         1     10243
>>>8   CMp_H3K4me3_m    CMN   H3K4  CMp_H3K4me3         1     12476
>>>9   CMa_H3K4me3_m    CMA   H3K4  CMa_H3K4me3         1      5632
>>>10   mES_H3K4m3_s     ES   H3K4  mES_H3K4me3         2     14917
>>>11  CMp_H3K4me3_s    CMN   H3K4  CMp_H3K4me3         2     10646
>>>12  CMa_H3K4me3_s    CMA   H3K4  CMa_H3K4me3         2      5985
>>>13  mES_H3K9me3_m     ES   H3K9  mES_H3K9me3         1       110
>>>14  CMp_H3K9me3_m    CMN   H3K9  CMp_H3K9me3         1       484
>>>15  CMa_H3K9me3_m    CMA   H3K9  CMa_H3K9me3         1       938
>>>16  mES_H3K9me3_s     ES   H3K9  mES_H3K9me3         2       569
>>>17  CMp_H3K9me3_s    CMN   H3K9  CMp_H3K9me3         2       808
>>>18  CMa_H3K9me3_s    CMA   H3K9  CMa_H3K9me3         2      3747
>>>19 mES_H3K79me2_m     ES  H3K79 mES_H3K79me2         1     21318
>>>20 CMp_H3K79me2_m    CMN  H3K79 CMp_H3K79me2         1     13210
>>>21 CMa_H3K79me2_m    CMA  H3K79 CMa_H3K79me2         1      2988
>>>22 mES_H3K79me2_s     ES  H3K79 mES_H3K79me2         2     22164
>>>23 CMp_H3K79me2_s    CMN  H3K79 CMp_H3K79me2         2     13429
>>>24 CMa_H3K79me2_s    CMA  H3K79 CMa_H3K79me2         2      6063
>>>
>>>
>>>> test=dba.count(test)
>>>
>>>> test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>>>#minMemebers was se to 2 due to
>>>the fact I have 2 replicates for each condition
>>>#....78 Contrasts:
>>>
>>>> test = dba.analyze(test)
>>>
>>>#for each condition generate a report
>>>>test.DB1=dba.report(test,contrast=1)
>>>
>>>
>>>I would like now to assess overlapping rates between replicates, and
>>>generate a venn diagram.
>>>but it seems like replicates are merged together and I cannot treat
>>>them as single samples.
>>>I noticed this behaviour also when i try to generate both a PCA plot
>>>and a heatmap plot, replicates are merged, so
>>>what's the utility to display both in the PCA or heatmap plot?
>>>would it be simple to display the original replicates to assess how do
>>>they cluster?
>>>
>>>>pdf('PCA_plot_DBA_CONDITION.pdf')
>>>>dba.plotPCA(test,attributes=DBA_CONDITION,
>>>>vColors=c("grey0","grey23","grey47","antiquewhite4","antiquewhite3","anti
>>>>quewhite1","dodgerblue4","dodgerblue","darkslategray1","brown4","brown","
>>>>coral"))
>>>>dev.off()
>>>
>>>>pdf('HeatMap.pdf')
>>>>corvals=dba.plotHeatmap(test)
>>>>dev.off()
>>>
>>>>pdf('HeatMap_RPKM_FOLD.pdf')
>>>>corvals=dba.plotHeatmap(test,score=DBA_SCORE_RPKM_FOLD)
>>>>dev.off()
>>>
>>>> sessionInfo()
>>>R version 2.15.1 (2012-06-22)
>>>Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>>locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C                 LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>[11] C_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>>attached base packages:
>>>[1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>[8] base
>>>
>>>other attached packages:
>>>[1] DiffBind_1.2.0      Biobase_2.16.0      GenomicRanges_1.8.9
>>>[4] IRanges_1.14.4      BiocGenerics_0.2.0
>>>
>>>loaded via a namespace (and not attached):
>>> [1] amap_0.8-7         edgeR_2.6.10       gdata_2.11.0
>>>gplots_2.11.0
>>> [5] gtools_2.7.0       limma_3.12.1       RColorBrewer_1.0-5
>>>stats4_2.15.1
>>> [9] tools_2.15.1       zlibbioc_1.2.0
>>>
>>>
>>>
>>>Any sugestion is appreciated,
>>>cheers,
>>>paolo
>>>
>>>