[BioC] DiffBind - overlap between different peak callers

Paolo Kunderfranco paolo.kunderfranco at gmail.com
Tue Aug 14 17:35:49 CEST 2012


Hi,
Alignment bed file and peak calling files are only 3 columns, chr,
start and end.

I don't think is a problem of my bed files, because the problem do not
arise while I am using all my dataset for the analysis.

>test=dba(sampleSheet="database.csv")
>test
>test=dba.count(test)
>test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>test=dba.analyze(test)

the code works fine, but when I select only the overlapping peaks
between my replicates:

>test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
>minOverlap=2, sampID="CMp_H3K4me3")
>test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
>minOverlap=2, sampID="CMa_H3K4me3")

......for all my samples..

and create a new dataset of my overlapping peaks

>consdata = dba(test, mask = test$masks$Consensus)
>consdata
12 Samples, 9754 sites in matrix (16143 total):
>>>              ID Tissue Factor    Condition Replicate Intervals
>>> 1  mES_H3K27me3     ES  H3K27 mES_H3K27me3       1-2      1333
>>> 2  CMp_H3K27me3    CMN  H3K27 CMp_H3K27me3       1-2       911
>>> 3  CMa_H3K27me3    CMA  H3K27 CMa_H3K27me3       1-2       316
>>> 4   mES_H3K9me3     ES   H3K9  mES_H3K9me3       1-2        89
>>> 5   CMp_H3K9me3    CMN   H3K9  CMp_H3K9me3       1-2        86
>>> 6   CMa_H3K9me3    CMA   H3K9  CMa_H3K9me3       1-2       323
>>> 7  mES_H3K79me2     ES  H3K79 mES_H3K79me2       1-2      8590
>>> 8  CMp_H3K79me2    CMN  H3K79 CMp_H3K79me2       1-2      4641
>>> 9  CMa_H3K79me2    CMA  H3K79 CMa_H3K79me2       1-2      1084
>>> 10  mES_H3K4me3     ES   H3K4  mES_H3K4me3       1-2     10018
>>> 11  CMp_H3K4me3    CMN   H3K4  CMp_H3K4me3       1-2      8748
>>> 12  CMa_H3K4me3    CMA   H3K4  CMa_H3K4me3       1-2      2998


>consdata = dba.count(consdata, minOverlap=1)

I am prompted out R immediately to my working directory

> terminate called after throwing an instance of 'std::out_of_range'
> what():  basic_string::compare
> Abort


Paolo




2012/8/14 Gordon Brown <Gordon.Brown at cancer.org.uk>:
> Hi, Paolo,
>
> If it's crashing in dba.count, it either can't find the bed files of the
> reads, or doesn't like their format.  Most likely the latter.  One
> unfortunate restriction the release version has is that it expects bed
> files to have either exactly 3 or 6 (or more) columns.  If your bed files
> have 4 or 5 columns, it'll crash trying to retrieve the (absent) strand
> column.  This bug is fixed in my latest development code (not yet
> available), but not in the release version.  Check to see if your bed
> files have 4 or 5 columns.  Otherwise, maybe there's some corruption in
> the data files.  Does it crash quickly, or does it take a while and then
> crash?
>
> If it's the 4/5 column problem, let me know and I'll send you a patch.
> Otherwise, I can whip up a little script that verifies the format of the
> bed files, and send it to you.
>
> Cheers,
>
>  - Gord
>
>
> On 2012-08-14 15:51, "Paolo Kunderfranco" <paolo.kunderfranco at gmail.com>
> wrote:
>
>>Hi,
>>Any idea on how to solve this problem? I am stuck just at the end of
>>the project,
>>Cheers,
>>Paolo
>>
>>
>>> After I call dba.count I am prompted out R:
>>>
>>> terminate called after throwing an instance of 'std::out_of_range'
>>>   what():  basic_string::compare
>>> Abort
>>>
>>
>>
>>2012/8/9 Paolo Kunderfranco <paolo.kunderfranco at gmail.com>:
>>> Dear Rory,
>>> Thanks again for Your input, I went through and finally obtained
>>> consensus peakset  identified by both peak callers for
>>> all the samples:
>>>
>>>
>>>
>>> rm(list=ls())
>>> library("DiffBind")
>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>>
>>> test=dba(sampleSheet="database.csv")
>>> #clustering based on occupancy/peak calling
>>> pdf('clustering based on occupancy_peak calling.pdf')
>>> dba.plotHeatmap(test)
>>> dba.plotPCA(test)
>>> dev.off()
>>>
>>> par(mfrow=c(1,3))
>>>
>>> pdf('VENN_H3K27me3.pdf')
>>> dba.plotVenn(test, test$masks$ES  &
>>> test$masks$H3K27,label1='mES_m',label2='mES_s')
>>> dba.plotVenn(test, test$masks$CMN &
>>> test$masks$H3K27,label1='CMp_m',label2='CMp_s')
>>> dba.plotVenn(test, test$masks$CMA &
>>> test$masks$H3K27,label1='CMa_m',label2='CMa_s')
>>> dev.off()
>>> pdf('VENN_H3K4me3.pdf')
>>> dba.plotVenn(test, test$masks$ES  &
>>> test$masks$H3K4,label1='mES_m',label2='mES_s')
>>> dba.plotVenn(test, test$masks$CMN &
>>> test$masks$H3K4,label1='CMp_m',label2='CMp_s')
>>> dba.plotVenn(test, test$masks$CMA &
>>> test$masks$H3K4,label1='CMa_m',label2='CMa_s')
>>> dev.off()
>>> pdf('VENN_H3K9me3.pdf')
>>> dba.plotVenn(test, test$masks$ES  &
>>> test$masks$H3K9,label1='mES_m',label2='mES_s')
>>> dba.plotVenn(test, test$masks$CMN &
>>> test$masks$H3K9,label1='CMp_m',label2='CMp_s')
>>> dba.plotVenn(test, test$masks$CMA &
>>> test$masks$H3K9,label1='CMa_m',label2='CMa_s')
>>> dev.off()
>>> pdf('VENN_H3K79me2.pdf')
>>> dba.plotVenn(test, test$masks$ES  &
>>> test$masks$H3K79,label1='mES_m',label2='mES_s')
>>> dba.plotVenn(test, test$masks$CMN &
>>> test$masks$H3K79,label1='CMp_m',label2='CMp_s')
>>> dba.plotVenn(test, test$masks$CMA &
>>> test$masks$H3K79,label1='CMa_m',label2='CMa_s')
>>> dev.off()
>>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K27,
>>> minOverlap=2, sampID="mES_H3K27me3")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>> minOverlap=2, sampID="CMp_H3K27me3")
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>> minOverlap=2, sampID="CMa_H3K27me3")
>>> pdf('Overlap_H3K27me3.pdf')
>>> dba.plotVenn(test, test$masks$H3K27 &
>>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>>> dev.off()
>>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K9,
>>> minOverlap=2, sampID="mES_H3K9me3")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K9,
>>> minOverlap=2, sampID="CMp_H3K9me3")
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K9,
>>> minOverlap=2, sampID="CMa_H3K9me3")
>>> pdf('Overlap_H3K9me3.pdf')
>>> dba.plotVenn(test, test$masks$H3K9 &
>>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>>> dev.off()
>>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K79,
>>> minOverlap=2, sampID="mES_H3K79me2")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K79,
>>> minOverlap=2, sampID="CMp_H3K79me2")
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K79,
>>> minOverlap=2, sampID="CMa_H3K79me2")
>>> pdf('Overlap_H3K79me2.pdf')
>>> dba.plotVenn(test, test$masks$H3K79 &
>>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>>> dev.off()
>>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K4,
>>> minOverlap=2, sampID="mES_H3K4me3")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
>>> minOverlap=2, sampID="CMp_H3K4me3")
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
>>> minOverlap=2, sampID="CMa_H3K4me3")
>>> pdf('Overlap_H3K4me3.pdf')
>>> dba.plotVenn(test, test$masks$H3K4 &
>>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>>> dev.off()
>>>
>>> consdata = dba(test, mask = test$masks$Consensus)
>>>> consdata
>>> 12 Samples, 9754 sites in matrix (16143 total):
>>>              ID Tissue Factor    Condition Replicate Intervals
>>> 1  mES_H3K27me3     ES  H3K27 mES_H3K27me3       1-2      1333
>>> 2  CMp_H3K27me3    CMN  H3K27 CMp_H3K27me3       1-2       911
>>> 3  CMa_H3K27me3    CMA  H3K27 CMa_H3K27me3       1-2       316
>>> 4   mES_H3K9me3     ES   H3K9  mES_H3K9me3       1-2        89
>>> 5   CMp_H3K9me3    CMN   H3K9  CMp_H3K9me3       1-2        86
>>> 6   CMa_H3K9me3    CMA   H3K9  CMa_H3K9me3       1-2       323
>>> 7  mES_H3K79me2     ES  H3K79 mES_H3K79me2       1-2      8590
>>> 8  CMp_H3K79me2    CMN  H3K79 CMp_H3K79me2       1-2      4641
>>> 9  CMa_H3K79me2    CMA  H3K79 CMa_H3K79me2       1-2      1084
>>> 10  mES_H3K4me3     ES   H3K4  mES_H3K4me3       1-2     10018
>>> 11  CMp_H3K4me3    CMN   H3K4  CMp_H3K4me3       1-2      8748
>>> 12  CMa_H3K4me3    CMA   H3K4  CMa_H3K4me3       1-2      2998
>>>
>>> consdata = dba.count(consdata, minOverlap=1)
>>>
>>> After I call dba.count I am prompted out R:
>>>
>>> terminate called after throwing an instance of 'std::out_of_range'
>>>   what():  basic_string::compare
>>> Abort
>>>
>>> Where am I wrong?
>>> Cheers,
>>> Paolo
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2012/8/8 Rory Stark <Rory.Stark at cancer.org.uk>:
>>>> Hello Paolo-
>>>>
>>>> To examine the overlaps between peak callers, you need to be working
>>>>with
>>>> the peak (occupancy) data BEFORE you settle on a global peakset used
>>>>for
>>>> counting and subsequent analysis. In your example below, you have this
>>>>in
>>>> "test" after calling "dba" but before calling "dba.count".
>>>>
>>>> So after
>>>>
>>>>> test=dba(sampleSheet="database.csv")
>>>>
>>>> you can check clustering based on occupancy/peak calling:
>>>>
>>>>> dba,plotHeatmap(test)
>>>>> dba.plotPCA(test)
>>>>
>>>>
>>>> You can also plot Venn diagrams of each of the peak caller overlaps --
>>>>eg
>>>> for the H3K27me3 mark, you could see the three sample overlaps as
>>>>follows:
>>>>
>>>>> par(mfrow=c(1,3))
>>>>> dba.plotVenn(test, test$masks$ES  & test$masks$H3K27)
>>>>> dba.plotVenn(test, test$masks$CMN & test$masks$H3K27)
>>>>
>>>>> dba.plotVenn(test, test$masks$CMA & test$masks$H3K27)
>>>>
>>>>
>>>> Note that specifying minOverlap=2 to dba.count doesn't limit the
>>>>overlap
>>>> to replicate peak callers. Rather, it creates a single set of peaks
>>>>that
>>>> are identified at least twice amongst all the peaksets: either by both
>>>> peak callers for a single sample, or by the same peak caller in
>>>>different
>>>> samples,  by different peak callers in different samples. If you want
>>>>to
>>>> use only peaks called by both peak callers for each sample,you can add
>>>>a
>>>> separate consensus peakset for eac condition, eg:
>>>>
>>>>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K27,
>>>>>minOvrlap=2, ID="mES_H3K27me3")
>>>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>>>>minOverlap=2, ID="CMp_H3K27e3")
>>>>
>>>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>>>>minOverlp=2, ID="CMa_H3K27me3")
>>>>
>>>>
>>>> repeated 9 more times for a total of 12 pairs. You could if you like
>>>>see
>>>> the peak overlaps of the three H3K27me3 samples at this point:
>>>>
>>>>> dba.plotVenn(test, testmasks$H3K27 & test$masks$Consensus)
>>>>
>>>> Once you have added all 12 consensus peaksets, create a new DBA object
>>>> containing only the consensus peaksets:
>>>>
>>>>> consdata = ba(test, mask = test$masks$Consensus)
>>>>
>>>> Then, if you want to use all the peaks identified by both peak callers
>>>>for
>>>> all the samples:
>>>>
>>>>> consdata = dba.count(consdata, minOverlap=1)
>>>>
>>>> And continue with the differential analysis as you had done.
>>>>
>>>> Cheers-
>>>> Rory
>>>>
>>>>>
>>>>>Dear All,
>>>>>I am using package DiffBind,
>>>>>Due to the fact that I miss real biological or technical replicates, I
>>>>>am assesing the overlapping rates between different peak callers.
>>>>>Here is how I set up my database to read in:
>>>>>
>>>>>3 Cell lines, 4 Factor, 2 Replicates, for a total of 12 Conditions
>>>>>
>>>>>> rm(list=ls())
>>>>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>>>>> test=dba(sampleSheet="database.csv")
>>>>>> test
>>>>>24 Samples, 19380 sites in matrix (40809 total):
>>>>>                  ID Tissue Factor    Condition Replicate Intervals
>>>>>1  mES_H3K27me3_m     ES  H3K27 mES_H3K27me3         1      1834
>>>>>2  CMp_H3K27me3_m    CMN  H3K27 CMp_H3K27me3         1      2450
>>>>>3  CMa_H3K27me3_m    CMA  H3K27 CMa_H3K27me3         1       990
>>>>>4  mES_H3K27me3_s     ES  H3K27 mES_H3K27me3         2      2188
>>>>>5  CMp_H3K27me3_s    CN  H3K27 CMp_H3K27me3         2      3388
>>>>>6  CMa_H3K27me3_s    CMA  H3K27 CMa_H3K27me3         2      5946
>>>>>7   mES_H3K4me3_m     ES   H3K4  mES_H3K4me3         1     10243
>>>>>8   CMp_H3K4me3_m    CMN   H3K4  CMp_H3K4me3         1     12476
>>>>>9   CMa_H3K4me3_m    CMA   H3K4  CMa_H3K4me3         1      5632
>>>>>10   mES_H3K4m3_s     ES   H3K4  mES_H3K4me3         2     14917
>>>>>11  CMp_H3K4me3_s    CMN   H3K4  CMp_H3K4me3         2     10646
>>>>>12  CMa_H3K4me3_s    CMA   H3K4  CMa_H3K4me3         2      5985
>>>>>13  mES_H3K9me3_m     ES   H3K9  mES_H3K9me3         1       110
>>>>>14  CMp_H3K9me3_m    CMN   H3K9  CMp_H3K9me3         1       484
>>>>>15  CMa_H3K9me3_m    CMA   H3K9  CMa_H3K9me3         1       938
>>>>>16  mES_H3K9me3_s     ES   H3K9  mES_H3K9me3         2       569
>>>>>17  CMp_H3K9me3_s    CMN   H3K9  CMp_H3K9me3         2       808
>>>>>18  CMa_H3K9me3_s    CMA   H3K9  CMa_H3K9me3         2      3747
>>>>>19 mES_H3K79me2_m     ES  H3K79 mES_H3K79me2         1     21318
>>>>>20 CMp_H3K79me2_m    CMN  H3K79 CMp_H3K79me2         1     13210
>>>>>21 CMa_H3K79me2_m    CMA  H3K79 CMa_H3K79me2         1      2988
>>>>>22 mES_H3K79me2_s     ES  H3K79 mES_H3K79me2         2     22164
>>>>>23 CMp_H3K79me2_s    CMN  H3K79 CMp_H3K79me2         2     13429
>>>>>24 CMa_H3K79me2_s    CMA  H3K79 CMa_H3K79me2         2      6063
>>>>>
>>>>>
>>>>>> test=dba.count(test)
>>>>>
>>>>>> test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>>>>>#minMemebers was se to 2 due to
>>>>>the fact I have 2 replicates for each condition
>>>>>#....78 Contrasts:
>>>>>
>>>>>> test = dba.analyze(test)
>>>>>
>>>>>#for each condition generate a report
>>>>>>test.DB1=dba.report(test,contrast=1)
>>>>>
>>>>>
>>>>>I would like now to assess overlapping rates between replicates, and
>>>>>generate a venn diagram.
>>>>>but it seems like replicates are merged together and I cannot treat
>>>>>them as single samples.
>>>>>I noticed this behaviour also when i try to generate both a PCA plot
>>>>>and a heatmap plot, replicates are merged, so
>>>>>what's the utility to display both in the PCA or heatmap plot?
>>>>>would it be simple to display the original replicates to assess how do
>>>>>they cluster?
>>>>>
>>>>>>pdf('PCA_plot_DBA_CONDITION.pdf')
>>>>>>dba.plotPCA(test,attributes=DBA_CONDITION,
>>>>>>vColors=c("grey0","grey23","grey47","antiquewhite4","antiquewhite3","a
>>>>>>nti
>>>>>>quewhite1","dodgerblue4","dodgerblue","darkslategray1","brown4","brown
>>>>>>","
>>>>>>coral"))
>>>>>>dev.off()
>>>>>
>>>>>>pdf('HeatMap.pdf')
>>>>>>corvals=dba.plotHeatmap(test)
>>>>>>dev.off()
>>>>>
>>>>>>pdf('HeatMap_RPKM_FOLD.pdf')
>>>>>>corvals=dba.plotHeatmap(test,score=DBA_SCORE_RPKM_FOLD)
>>>>>>dev.off()
>>>>>
>>>>>> sessionInfo()
>>>>>R version 2.15.1 (2012-06-22)
>>>>>Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>>locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=C                 LC_NAME=C
>>>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>[11] C_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>>attached base packages:
>>>>>[1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>>>[8] base
>>>>>
>>>>>other attached packages:
>>>>>[1] DiffBind_1.2.0      Biobase_2.16.0      GenomicRanges_1.8.9
>>>>>[4] IRanges_1.14.4      BiocGenerics_0.2.0
>>>>>
>>>>>loaded via a namespace (and not attached):
>>>>> [1] amap_0.8-7         edgeR_2.6.10       gdata_2.11.0
>>>>>gplots_2.11.0
>>>>> [5] gtools_2.7.0       limma_3.12.1       RColorBrewer_1.0-5
>>>>>stats4_2.15.1
>>>>> [9] tools_2.15.1       zlibbioc_1.2.0
>>>>>
>>>>>
>>>>>
>>>>>Any sugestion is appreciated,
>>>>>cheers,
>>>>>paolo
>>>>>
>>>>>
>
>
> NOTICE AND DISCLAIMER
> This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose.
>
> We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.
> Cancer Research UK
> Registered charity in England and Wales (1089464), Scotland (SC041666) and the Isle of Man (1103)
> A company limited by guarantee.  Registered company in England and Wales (4325234) and the Isle of Man (5713F).
> Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.



More information about the Bioconductor mailing list