[BioC] DiffBind - overlap between different peak callers

Tue Aug 14 17:12:57 CEST 2012

Hi, Paolo,

If it's crashing in dba.count, it either can't find the bed files of the
reads, or doesn't like their format.  Most likely the latter.  One
unfortunate restriction the release version has is that it expects bed
files to have either exactly 3 or 6 (or more) columns.  If your bed files
have 4 or 5 columns, it'll crash trying to retrieve the (absent) strand
column.  This bug is fixed in my latest development code (not yet
available), but not in the release version.  Check to see if your bed
files have 4 or 5 columns.  Otherwise, maybe there's some corruption in
the data files.  Does it crash quickly, or does it take a while and then
crash?

If it's the 4/5 column problem, let me know and I'll send you a patch.
Otherwise, I can whip up a little script that verifies the format of the
bed files, and send it to you.

Cheers,

 - Gord

On 2012-08-14 15:51, "Paolo Kunderfranco" <paolo.kunderfranco at gmail.com>
wrote:

>Hi,
>Any idea on how to solve this problem? I am stuck just at the end of
>the project,
>Cheers,
>Paolo
>
>
>> After I call dba.count I am prompted out R:
>>
>> terminate called after throwing an instance of 'std::out_of_range'
>>   what():  basic_string::compare
>> Abort
>>
>
>
>2012/8/9 Paolo Kunderfranco <paolo.kunderfranco at gmail.com>:
>> Dear Rory,
>> Thanks again for Your input, I went through and finally obtained
>> consensus peakset  identified by both peak callers for
>> all the samples:
>>
>>
>>
>> rm(list=ls())
>> library("DiffBind")
>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>
>> test=dba(sampleSheet="database.csv")
>> #clustering based on occupancy/peak calling
>> pdf('clustering based on occupancy_peak calling.pdf')
>> dba.plotHeatmap(test)
>> dba.plotPCA(test)
>> dev.off()
>>
>> par(mfrow=c(1,3))
>>
>> pdf('VENN_H3K27me3.pdf')
>> dba.plotVenn(test, test$masks$ES  &
>> test$masks$H3K27,label1='mES_m',label2='mES_s')
>> dba.plotVenn(test, test$masks$CMN &
>> test$masks$H3K27,label1='CMp_m',label2='CMp_s')
>> dba.plotVenn(test, test$masks$CMA &
>> test$masks$H3K27,label1='CMa_m',label2='CMa_s')
>> dev.off()
>> pdf('VENN_H3K4me3.pdf')
>> dba.plotVenn(test, test$masks$ES  &
>> test$masks$H3K4,label1='mES_m',label2='mES_s')
>> dba.plotVenn(test, test$masks$CMN &
>> test$masks$H3K4,label1='CMp_m',label2='CMp_s')
>> dba.plotVenn(test, test$masks$CMA &
>> test$masks$H3K4,label1='CMa_m',label2='CMa_s')
>> dev.off()
>> pdf('VENN_H3K9me3.pdf')
>> dba.plotVenn(test, test$masks$ES  &
>> test$masks$H3K9,label1='mES_m',label2='mES_s')
>> dba.plotVenn(test, test$masks$CMN &
>> test$masks$H3K9,label1='CMp_m',label2='CMp_s')
>> dba.plotVenn(test, test$masks$CMA &
>> test$masks$H3K9,label1='CMa_m',label2='CMa_s')
>> dev.off()
>> pdf('VENN_H3K79me2.pdf')
>> dba.plotVenn(test, test$masks$ES  &
>> test$masks$H3K79,label1='mES_m',label2='mES_s')
>> dba.plotVenn(test, test$masks$CMN &
>> test$masks$H3K79,label1='CMp_m',label2='CMp_s')
>> dba.plotVenn(test, test$masks$CMA &
>> test$masks$H3K79,label1='CMa_m',label2='CMa_s')
>> dev.off()
>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K27,
>> minOverlap=2, sampID="mES_H3K27me3")
>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>> minOverlap=2, sampID="CMp_H3K27me3")
>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>> minOverlap=2, sampID="CMa_H3K27me3")
>> pdf('Overlap_H3K27me3.pdf')
>> dba.plotVenn(test, test$masks$H3K27 &
>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>> dev.off()
>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K9,
>> minOverlap=2, sampID="mES_H3K9me3")
>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K9,
>> minOverlap=2, sampID="CMp_H3K9me3")
>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K9,
>> minOverlap=2, sampID="CMa_H3K9me3")
>> pdf('Overlap_H3K9me3.pdf')
>> dba.plotVenn(test, test$masks$H3K9 &
>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>> dev.off()
>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K79,
>> minOverlap=2, sampID="mES_H3K79me2")
>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K79,
>> minOverlap=2, sampID="CMp_H3K79me2")
>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K79,
>> minOverlap=2, sampID="CMa_H3K79me2")
>> pdf('Overlap_H3K79me2.pdf')
>> dba.plotVenn(test, test$masks$H3K79 &
>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>> dev.off()
>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K4,
>> minOverlap=2, sampID="mES_H3K4me3")
>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
>> minOverlap=2, sampID="CMp_H3K4me3")
>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
>> minOverlap=2, sampID="CMa_H3K4me3")
>> pdf('Overlap_H3K4me3.pdf')
>> dba.plotVenn(test, test$masks$H3K4 &
>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>> dev.off()
>>
>> consdata = dba(test, mask = test$masks$Consensus)
>>> consdata
>> 12 Samples, 9754 sites in matrix (16143 total):
>>              ID Tissue Factor    Condition Replicate Intervals
>> 1  mES_H3K27me3     ES  H3K27 mES_H3K27me3       1-2      1333
>> 2  CMp_H3K27me3    CMN  H3K27 CMp_H3K27me3       1-2       911
>> 3  CMa_H3K27me3    CMA  H3K27 CMa_H3K27me3       1-2       316
>> 4   mES_H3K9me3     ES   H3K9  mES_H3K9me3       1-2        89
>> 5   CMp_H3K9me3    CMN   H3K9  CMp_H3K9me3       1-2        86
>> 6   CMa_H3K9me3    CMA   H3K9  CMa_H3K9me3       1-2       323
>> 7  mES_H3K79me2     ES  H3K79 mES_H3K79me2       1-2      8590
>> 8  CMp_H3K79me2    CMN  H3K79 CMp_H3K79me2       1-2      4641
>> 9  CMa_H3K79me2    CMA  H3K79 CMa_H3K79me2       1-2      1084
>> 10  mES_H3K4me3     ES   H3K4  mES_H3K4me3       1-2     10018
>> 11  CMp_H3K4me3    CMN   H3K4  CMp_H3K4me3       1-2      8748
>> 12  CMa_H3K4me3    CMA   H3K4  CMa_H3K4me3       1-2      2998
>>
>> consdata = dba.count(consdata, minOverlap=1)
>>
>> After I call dba.count I am prompted out R:
>>
>> terminate called after throwing an instance of 'std::out_of_range'
>>   what():  basic_string::compare
>> Abort
>>
>> Where am I wrong?
>> Cheers,
>> Paolo
>>
>>
>>
>>
>>
>>
>> 2012/8/8 Rory Stark <Rory.Stark at cancer.org.uk>:
>>> Hello Paolo-
>>>
>>> To examine the overlaps between peak callers, you need to be working
>>>with
>>> the peak (occupancy) data BEFORE you settle on a global peakset used
>>>for
>>> counting and subsequent analysis. In your example below, you have this
>>>in
>>> "test" after calling "dba" but before calling "dba.count".
>>>
>>> So after
>>>
>>>> test=dba(sampleSheet="database.csv")
>>>
>>> you can check clustering based on occupancy/peak calling:
>>>
>>>> dba,plotHeatmap(test)
>>>> dba.plotPCA(test)
>>>
>>>
>>> You can also plot Venn diagrams of each of the peak caller overlaps --
>>>eg
>>> for the H3K27me3 mark, you could see the three sample overlaps as
>>>follows:
>>>
>>>> par(mfrow=c(1,3))
>>>> dba.plotVenn(test, test$masks$ES  & test$masks$H3K27)
>>>> dba.plotVenn(test, test$masks$CMN & test$masks$H3K27)
>>>
>>>> dba.plotVenn(test, test$masks$CMA & test$masks$H3K27)
>>>
>>>
>>> Note that specifying minOverlap=2 to dba.count doesn't limit the
>>>overlap
>>> to replicate peak callers. Rather, it creates a single set of peaks
>>>that
>>> are identified at least twice amongst all the peaksets: either by both
>>> peak callers for a single sample, or by the same peak caller in
>>>different
>>> samples,  by different peak callers in different samples. If you want
>>>to
>>> use only peaks called by both peak callers for each sample,you can add
>>>a
>>> separate consensus peakset for eac condition, eg:
>>>
>>>> test = dba.peakset(test, test$masks$ES  & test$masks$H3K27,
>>>>minOvrlap=2, ID="mES_H3K27me3")
>>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>>>minOverlap=2, ID="CMp_H3K27e3")
>>>
>>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>>>minOverlp=2, ID="CMa_H3K27me3")
>>>
>>>
>>> repeated 9 more times for a total of 12 pairs. You could if you like
>>>see
>>> the peak overlaps of the three H3K27me3 samples at this point:
>>>
>>>> dba.plotVenn(test, testmasks$H3K27 & test$masks$Consensus)
>>>
>>> Once you have added all 12 consensus peaksets, create a new DBA object
>>> containing only the consensus peaksets:
>>>
>>>> consdata = ba(test, mask = test$masks$Consensus)
>>>
>>> Then, if you want to use all the peaks identified by both peak callers
>>>for
>>> all the samples:
>>>
>>>> consdata = dba.count(consdata, minOverlap=1)
>>>
>>> And continue with the differential analysis as you had done.
>>>
>>> Cheers-
>>> Rory
>>>
>>>>
>>>>Dear All,
>>>>I am using package DiffBind,
>>>>Due to the fact that I miss real biological or technical replicates, I
>>>>am assesing the overlapping rates between different peak callers.
>>>>Here is how I set up my database to read in:
>>>>
>>>>3 Cell lines, 4 Factor, 2 Replicates, for a total of 12 Conditions
>>>>
>>>>> rm(list=ls())
>>>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>>>> test=dba(sampleSheet="database.csv")
>>>>> test
>>>>24 Samples, 19380 sites in matrix (40809 total):
>>>>                  ID Tissue Factor    Condition Replicate Intervals
>>>>1  mES_H3K27me3_m     ES  H3K27 mES_H3K27me3         1      1834
>>>>2  CMp_H3K27me3_m    CMN  H3K27 CMp_H3K27me3         1      2450
>>>>3  CMa_H3K27me3_m    CMA  H3K27 CMa_H3K27me3         1       990
>>>>4  mES_H3K27me3_s     ES  H3K27 mES_H3K27me3         2      2188
>>>>5  CMp_H3K27me3_s    CN  H3K27 CMp_H3K27me3         2      3388
>>>>6  CMa_H3K27me3_s    CMA  H3K27 CMa_H3K27me3         2      5946
>>>>7   mES_H3K4me3_m     ES   H3K4  mES_H3K4me3         1     10243
>>>>8   CMp_H3K4me3_m    CMN   H3K4  CMp_H3K4me3         1     12476
>>>>9   CMa_H3K4me3_m    CMA   H3K4  CMa_H3K4me3         1      5632
>>>>10   mES_H3K4m3_s     ES   H3K4  mES_H3K4me3         2     14917
>>>>11  CMp_H3K4me3_s    CMN   H3K4  CMp_H3K4me3         2     10646
>>>>12  CMa_H3K4me3_s    CMA   H3K4  CMa_H3K4me3         2      5985
>>>>13  mES_H3K9me3_m     ES   H3K9  mES_H3K9me3         1       110
>>>>14  CMp_H3K9me3_m    CMN   H3K9  CMp_H3K9me3         1       484
>>>>15  CMa_H3K9me3_m    CMA   H3K9  CMa_H3K9me3         1       938
>>>>16  mES_H3K9me3_s     ES   H3K9  mES_H3K9me3         2       569
>>>>17  CMp_H3K9me3_s    CMN   H3K9  CMp_H3K9me3         2       808
>>>>18  CMa_H3K9me3_s    CMA   H3K9  CMa_H3K9me3         2      3747
>>>>19 mES_H3K79me2_m     ES  H3K79 mES_H3K79me2         1     21318
>>>>20 CMp_H3K79me2_m    CMN  H3K79 CMp_H3K79me2         1     13210
>>>>21 CMa_H3K79me2_m    CMA  H3K79 CMa_H3K79me2         1      2988
>>>>22 mES_H3K79me2_s     ES  H3K79 mES_H3K79me2         2     22164
>>>>23 CMp_H3K79me2_s    CMN  H3K79 CMp_H3K79me2         2     13429
>>>>24 CMa_H3K79me2_s    CMA  H3K79 CMa_H3K79me2         2      6063
>>>>
>>>>
>>>>> test=dba.count(test)
>>>>
>>>>> test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>>>>#minMemebers was se to 2 due to
>>>>the fact I have 2 replicates for each condition
>>>>#....78 Contrasts:
>>>>
>>>>> test = dba.analyze(test)
>>>>
>>>>#for each condition generate a report
>>>>>test.DB1=dba.report(test,contrast=1)
>>>>
>>>>
>>>>I would like now to assess overlapping rates between replicates, and
>>>>generate a venn diagram.
>>>>but it seems like replicates are merged together and I cannot treat
>>>>them as single samples.
>>>>I noticed this behaviour also when i try to generate both a PCA plot
>>>>and a heatmap plot, replicates are merged, so
>>>>what's the utility to display both in the PCA or heatmap plot?
>>>>would it be simple to display the original replicates to assess how do
>>>>they cluster?
>>>>
>>>>>pdf('PCA_plot_DBA_CONDITION.pdf')
>>>>>dba.plotPCA(test,attributes=DBA_CONDITION,
>>>>>vColors=c("grey0","grey23","grey47","antiquewhite4","antiquewhite3","a
>>>>>nti
>>>>>quewhite1","dodgerblue4","dodgerblue","darkslategray1","brown4","brown
>>>>>","
>>>>>coral"))
>>>>>dev.off()
>>>>
>>>>>pdf('HeatMap.pdf')
>>>>>corvals=dba.plotHeatmap(test)
>>>>>dev.off()
>>>>
>>>>>pdf('HeatMap_RPKM_FOLD.pdf')
>>>>>corvals=dba.plotHeatmap(test,score=DBA_SCORE_RPKM_FOLD)
>>>>>dev.off()
>>>>
>>>>> sessionInfo()
>>>>R version 2.15.1 (2012-06-22)
>>>>Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>>locale:
>>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=C                 LC_NAME=C
>>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>[11] C_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>>attached base packages:
>>>>[1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>>[8] base
>>>>
>>>>other attached packages:
>>>>[1] DiffBind_1.2.0      Biobase_2.16.0      GenomicRanges_1.8.9
>>>>[4] IRanges_1.14.4      BiocGenerics_0.2.0
>>>>
>>>>loaded via a namespace (and not attached):
>>>> [1] amap_0.8-7         edgeR_2.6.10       gdata_2.11.0
>>>>gplots_2.11.0
>>>> [5] gtools_2.7.0       limma_3.12.1       RColorBrewer_1.0-5
>>>>stats4_2.15.1
>>>> [9] tools_2.15.1       zlibbioc_1.2.0
>>>>
>>>>
>>>>
>>>>Any sugestion is appreciated,
>>>>cheers,
>>>>paolo
>>>>
>>>>

NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for ...{{dropped:17}}