[BioC] DiffBind - overlap between different peak callers
Paolo Kunderfranco
paolo.kunderfranco at gmail.com
Tue Aug 14 16:51:31 CEST 2012
Hi,
Any idea on how to solve this problem? I am stuck just at the end of
the project,
Cheers,
Paolo
> After I call dba.count I am prompted out R:
>
> terminate called after throwing an instance of 'std::out_of_range'
> what(): basic_string::compare
> Abort
>
2012/8/9 Paolo Kunderfranco <paolo.kunderfranco at gmail.com>:
> Dear Rory,
> Thanks again for Your input, I went through and finally obtained
> consensus peakset identified by both peak callers for
> all the samples:
>
>
>
> rm(list=ls())
> library("DiffBind")
> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>
> test=dba(sampleSheet="database.csv")
> #clustering based on occupancy/peak calling
> pdf('clustering based on occupancy_peak calling.pdf')
> dba.plotHeatmap(test)
> dba.plotPCA(test)
> dev.off()
>
> par(mfrow=c(1,3))
>
> pdf('VENN_H3K27me3.pdf')
> dba.plotVenn(test, test$masks$ES &
> test$masks$H3K27,label1='mES_m',label2='mES_s')
> dba.plotVenn(test, test$masks$CMN &
> test$masks$H3K27,label1='CMp_m',label2='CMp_s')
> dba.plotVenn(test, test$masks$CMA &
> test$masks$H3K27,label1='CMa_m',label2='CMa_s')
> dev.off()
> pdf('VENN_H3K4me3.pdf')
> dba.plotVenn(test, test$masks$ES &
> test$masks$H3K4,label1='mES_m',label2='mES_s')
> dba.plotVenn(test, test$masks$CMN &
> test$masks$H3K4,label1='CMp_m',label2='CMp_s')
> dba.plotVenn(test, test$masks$CMA &
> test$masks$H3K4,label1='CMa_m',label2='CMa_s')
> dev.off()
> pdf('VENN_H3K9me3.pdf')
> dba.plotVenn(test, test$masks$ES &
> test$masks$H3K9,label1='mES_m',label2='mES_s')
> dba.plotVenn(test, test$masks$CMN &
> test$masks$H3K9,label1='CMp_m',label2='CMp_s')
> dba.plotVenn(test, test$masks$CMA &
> test$masks$H3K9,label1='CMa_m',label2='CMa_s')
> dev.off()
> pdf('VENN_H3K79me2.pdf')
> dba.plotVenn(test, test$masks$ES &
> test$masks$H3K79,label1='mES_m',label2='mES_s')
> dba.plotVenn(test, test$masks$CMN &
> test$masks$H3K79,label1='CMp_m',label2='CMp_s')
> dba.plotVenn(test, test$masks$CMA &
> test$masks$H3K79,label1='CMa_m',label2='CMa_s')
> dev.off()
> test = dba.peakset(test, test$masks$ES & test$masks$H3K27,
> minOverlap=2, sampID="mES_H3K27me3")
> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
> minOverlap=2, sampID="CMp_H3K27me3")
> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
> minOverlap=2, sampID="CMa_H3K27me3")
> pdf('Overlap_H3K27me3.pdf')
> dba.plotVenn(test, test$masks$H3K27 &
> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
> dev.off()
> test = dba.peakset(test, test$masks$ES & test$masks$H3K9,
> minOverlap=2, sampID="mES_H3K9me3")
> test = dba.peakset(test, test$masks$CMN & test$masks$H3K9,
> minOverlap=2, sampID="CMp_H3K9me3")
> test = dba.peakset(test, test$masks$CMA & test$masks$H3K9,
> minOverlap=2, sampID="CMa_H3K9me3")
> pdf('Overlap_H3K9me3.pdf')
> dba.plotVenn(test, test$masks$H3K9 &
> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
> dev.off()
> test = dba.peakset(test, test$masks$ES & test$masks$H3K79,
> minOverlap=2, sampID="mES_H3K79me2")
> test = dba.peakset(test, test$masks$CMN & test$masks$H3K79,
> minOverlap=2, sampID="CMp_H3K79me2")
> test = dba.peakset(test, test$masks$CMA & test$masks$H3K79,
> minOverlap=2, sampID="CMa_H3K79me2")
> pdf('Overlap_H3K79me2.pdf')
> dba.plotVenn(test, test$masks$H3K79 &
> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
> dev.off()
> test = dba.peakset(test, test$masks$ES & test$masks$H3K4,
> minOverlap=2, sampID="mES_H3K4me3")
> test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
> minOverlap=2, sampID="CMp_H3K4me3")
> test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
> minOverlap=2, sampID="CMa_H3K4me3")
> pdf('Overlap_H3K4me3.pdf')
> dba.plotVenn(test, test$masks$H3K4 &
> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
> dev.off()
>
> consdata = dba(test, mask = test$masks$Consensus)
>> consdata
> 12 Samples, 9754 sites in matrix (16143 total):
> ID Tissue Factor Condition Replicate Intervals
> 1 mES_H3K27me3 ES H3K27 mES_H3K27me3 1-2 1333
> 2 CMp_H3K27me3 CMN H3K27 CMp_H3K27me3 1-2 911
> 3 CMa_H3K27me3 CMA H3K27 CMa_H3K27me3 1-2 316
> 4 mES_H3K9me3 ES H3K9 mES_H3K9me3 1-2 89
> 5 CMp_H3K9me3 CMN H3K9 CMp_H3K9me3 1-2 86
> 6 CMa_H3K9me3 CMA H3K9 CMa_H3K9me3 1-2 323
> 7 mES_H3K79me2 ES H3K79 mES_H3K79me2 1-2 8590
> 8 CMp_H3K79me2 CMN H3K79 CMp_H3K79me2 1-2 4641
> 9 CMa_H3K79me2 CMA H3K79 CMa_H3K79me2 1-2 1084
> 10 mES_H3K4me3 ES H3K4 mES_H3K4me3 1-2 10018
> 11 CMp_H3K4me3 CMN H3K4 CMp_H3K4me3 1-2 8748
> 12 CMa_H3K4me3 CMA H3K4 CMa_H3K4me3 1-2 2998
>
> consdata = dba.count(consdata, minOverlap=1)
>
> After I call dba.count I am prompted out R:
>
> terminate called after throwing an instance of 'std::out_of_range'
> what(): basic_string::compare
> Abort
>
> Where am I wrong?
> Cheers,
> Paolo
>
>
>
>
>
>
> 2012/8/8 Rory Stark <Rory.Stark at cancer.org.uk>:
>> Hello Paolo-
>>
>> To examine the overlaps between peak callers, you need to be working with
>> the peak (occupancy) data BEFORE you settle on a global peakset used for
>> counting and subsequent analysis. In your example below, you have this in
>> "test" after calling "dba" but before calling "dba.count".
>>
>> So after
>>
>>> test=dba(sampleSheet="database.csv")
>>
>> you can check clustering based on occupancy/peak calling:
>>
>>> dba,plotHeatmap(test)
>>> dba.plotPCA(test)
>>
>>
>> You can also plot Venn diagrams of each of the peak caller overlaps -- eg
>> for the H3K27me3 mark, you could see the three sample overlaps as follows:
>>
>>> par(mfrow=c(1,3))
>>> dba.plotVenn(test, test$masks$ES & test$masks$H3K27)
>>> dba.plotVenn(test, test$masks$CMN & test$masks$H3K27)
>>
>>> dba.plotVenn(test, test$masks$CMA & test$masks$H3K27)
>>
>>
>> Note that specifying minOverlap=2 to dba.count doesn't limit the overlap
>> to replicate peak callers. Rather, it creates a single set of peaks that
>> are identified at least twice amongst all the peaksets: either by both
>> peak callers for a single sample, or by the same peak caller in different
>> samples, by different peak callers in different samples. If you want to
>> use only peaks called by both peak callers for each sample,you can add a
>> separate consensus peakset for eac condition, eg:
>>
>>> test = dba.peakset(test, test$masks$ES & test$masks$H3K27,
>>>minOvrlap=2, ID="mES_H3K27me3")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>>minOverlap=2, ID="CMp_H3K27e3")
>>
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>>minOverlp=2, ID="CMa_H3K27me3")
>>
>>
>> repeated 9 more times for a total of 12 pairs. You could if you like see
>> the peak overlaps of the three H3K27me3 samples at this point:
>>
>>> dba.plotVenn(test, testmasks$H3K27 & test$masks$Consensus)
>>
>> Once you have added all 12 consensus peaksets, create a new DBA object
>> containing only the consensus peaksets:
>>
>>> consdata = ba(test, mask = test$masks$Consensus)
>>
>> Then, if you want to use all the peaks identified by both peak callers for
>> all the samples:
>>
>>> consdata = dba.count(consdata, minOverlap=1)
>>
>> And continue with the differential analysis as you had done.
>>
>> Cheers-
>> Rory
>>
>>>
>>>Dear All,
>>>I am using package DiffBind,
>>>Due to the fact that I miss real biological or technical replicates, I
>>>am assesing the overlapping rates between different peak callers.
>>>Here is how I set up my database to read in:
>>>
>>>3 Cell lines, 4 Factor, 2 Replicates, for a total of 12 Conditions
>>>
>>>> rm(list=ls())
>>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>>> test=dba(sampleSheet="database.csv")
>>>> test
>>>24 Samples, 19380 sites in matrix (40809 total):
>>> ID Tissue Factor Condition Replicate Intervals
>>>1 mES_H3K27me3_m ES H3K27 mES_H3K27me3 1 1834
>>>2 CMp_H3K27me3_m CMN H3K27 CMp_H3K27me3 1 2450
>>>3 CMa_H3K27me3_m CMA H3K27 CMa_H3K27me3 1 990
>>>4 mES_H3K27me3_s ES H3K27 mES_H3K27me3 2 2188
>>>5 CMp_H3K27me3_s CN H3K27 CMp_H3K27me3 2 3388
>>>6 CMa_H3K27me3_s CMA H3K27 CMa_H3K27me3 2 5946
>>>7 mES_H3K4me3_m ES H3K4 mES_H3K4me3 1 10243
>>>8 CMp_H3K4me3_m CMN H3K4 CMp_H3K4me3 1 12476
>>>9 CMa_H3K4me3_m CMA H3K4 CMa_H3K4me3 1 5632
>>>10 mES_H3K4m3_s ES H3K4 mES_H3K4me3 2 14917
>>>11 CMp_H3K4me3_s CMN H3K4 CMp_H3K4me3 2 10646
>>>12 CMa_H3K4me3_s CMA H3K4 CMa_H3K4me3 2 5985
>>>13 mES_H3K9me3_m ES H3K9 mES_H3K9me3 1 110
>>>14 CMp_H3K9me3_m CMN H3K9 CMp_H3K9me3 1 484
>>>15 CMa_H3K9me3_m CMA H3K9 CMa_H3K9me3 1 938
>>>16 mES_H3K9me3_s ES H3K9 mES_H3K9me3 2 569
>>>17 CMp_H3K9me3_s CMN H3K9 CMp_H3K9me3 2 808
>>>18 CMa_H3K9me3_s CMA H3K9 CMa_H3K9me3 2 3747
>>>19 mES_H3K79me2_m ES H3K79 mES_H3K79me2 1 21318
>>>20 CMp_H3K79me2_m CMN H3K79 CMp_H3K79me2 1 13210
>>>21 CMa_H3K79me2_m CMA H3K79 CMa_H3K79me2 1 2988
>>>22 mES_H3K79me2_s ES H3K79 mES_H3K79me2 2 22164
>>>23 CMp_H3K79me2_s CMN H3K79 CMp_H3K79me2 2 13429
>>>24 CMa_H3K79me2_s CMA H3K79 CMa_H3K79me2 2 6063
>>>
>>>
>>>> test=dba.count(test)
>>>
>>>> test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>>>#minMemebers was se to 2 due to
>>>the fact I have 2 replicates for each condition
>>>#....78 Contrasts:
>>>
>>>> test = dba.analyze(test)
>>>
>>>#for each condition generate a report
>>>>test.DB1=dba.report(test,contrast=1)
>>>
>>>
>>>I would like now to assess overlapping rates between replicates, and
>>>generate a venn diagram.
>>>but it seems like replicates are merged together and I cannot treat
>>>them as single samples.
>>>I noticed this behaviour also when i try to generate both a PCA plot
>>>and a heatmap plot, replicates are merged, so
>>>what's the utility to display both in the PCA or heatmap plot?
>>>would it be simple to display the original replicates to assess how do
>>>they cluster?
>>>
>>>>pdf('PCA_plot_DBA_CONDITION.pdf')
>>>>dba.plotPCA(test,attributes=DBA_CONDITION,
>>>>vColors=c("grey0","grey23","grey47","antiquewhite4","antiquewhite3","anti
>>>>quewhite1","dodgerblue4","dodgerblue","darkslategray1","brown4","brown","
>>>>coral"))
>>>>dev.off()
>>>
>>>>pdf('HeatMap.pdf')
>>>>corvals=dba.plotHeatmap(test)
>>>>dev.off()
>>>
>>>>pdf('HeatMap_RPKM_FOLD.pdf')
>>>>corvals=dba.plotHeatmap(test,score=DBA_SCORE_RPKM_FOLD)
>>>>dev.off()
>>>
>>>> sessionInfo()
>>>R version 2.15.1 (2012-06-22)
>>>Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>>locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>[11] C_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>>attached base packages:
>>>[1] parallel stats graphics grDevices utils datasets methods
>>>[8] base
>>>
>>>other attached packages:
>>>[1] DiffBind_1.2.0 Biobase_2.16.0 GenomicRanges_1.8.9
>>>[4] IRanges_1.14.4 BiocGenerics_0.2.0
>>>
>>>loaded via a namespace (and not attached):
>>> [1] amap_0.8-7 edgeR_2.6.10 gdata_2.11.0
>>>gplots_2.11.0
>>> [5] gtools_2.7.0 limma_3.12.1 RColorBrewer_1.0-5
>>>stats4_2.15.1
>>> [9] tools_2.15.1 zlibbioc_1.2.0
>>>
>>>
>>>
>>>Any sugestion is appreciated,
>>>cheers,
>>>paolo
>>>
>>>
More information about the Bioconductor
mailing list