[BioC] DiffBind - overlap between different peak callers
Paolo Kunderfranco
paolo.kunderfranco at gmail.com
Tue Aug 14 17:35:49 CEST 2012
Hi,
Alignment bed file and peak calling files are only 3 columns, chr,
start and end.
I don't think is a problem of my bed files, because the problem do not
arise while I am using all my dataset for the analysis.
>test=dba(sampleSheet="database.csv")
>test
>test=dba.count(test)
>test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>test=dba.analyze(test)
the code works fine, but when I select only the overlapping peaks
between my replicates:
>test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
>minOverlap=2, sampID="CMp_H3K4me3")
>test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
>minOverlap=2, sampID="CMa_H3K4me3")
......for all my samples..
and create a new dataset of my overlapping peaks
>consdata = dba(test, mask = test$masks$Consensus)
>consdata
12 Samples, 9754 sites in matrix (16143 total):
>>> ID Tissue Factor Condition Replicate Intervals
>>> 1 mES_H3K27me3 ES H3K27 mES_H3K27me3 1-2 1333
>>> 2 CMp_H3K27me3 CMN H3K27 CMp_H3K27me3 1-2 911
>>> 3 CMa_H3K27me3 CMA H3K27 CMa_H3K27me3 1-2 316
>>> 4 mES_H3K9me3 ES H3K9 mES_H3K9me3 1-2 89
>>> 5 CMp_H3K9me3 CMN H3K9 CMp_H3K9me3 1-2 86
>>> 6 CMa_H3K9me3 CMA H3K9 CMa_H3K9me3 1-2 323
>>> 7 mES_H3K79me2 ES H3K79 mES_H3K79me2 1-2 8590
>>> 8 CMp_H3K79me2 CMN H3K79 CMp_H3K79me2 1-2 4641
>>> 9 CMa_H3K79me2 CMA H3K79 CMa_H3K79me2 1-2 1084
>>> 10 mES_H3K4me3 ES H3K4 mES_H3K4me3 1-2 10018
>>> 11 CMp_H3K4me3 CMN H3K4 CMp_H3K4me3 1-2 8748
>>> 12 CMa_H3K4me3 CMA H3K4 CMa_H3K4me3 1-2 2998
>consdata = dba.count(consdata, minOverlap=1)
I am prompted out R immediately to my working directory
> terminate called after throwing an instance of 'std::out_of_range'
> what(): basic_string::compare
> Abort
Paolo
2012/8/14 Gordon Brown <Gordon.Brown at cancer.org.uk>:
> Hi, Paolo,
>
> If it's crashing in dba.count, it either can't find the bed files of the
> reads, or doesn't like their format. Most likely the latter. One
> unfortunate restriction the release version has is that it expects bed
> files to have either exactly 3 or 6 (or more) columns. If your bed files
> have 4 or 5 columns, it'll crash trying to retrieve the (absent) strand
> column. This bug is fixed in my latest development code (not yet
> available), but not in the release version. Check to see if your bed
> files have 4 or 5 columns. Otherwise, maybe there's some corruption in
> the data files. Does it crash quickly, or does it take a while and then
> crash?
>
> If it's the 4/5 column problem, let me know and I'll send you a patch.
> Otherwise, I can whip up a little script that verifies the format of the
> bed files, and send it to you.
>
> Cheers,
>
> - Gord
>
>
> On 2012-08-14 15:51, "Paolo Kunderfranco" <paolo.kunderfranco at gmail.com>
> wrote:
>
>>Hi,
>>Any idea on how to solve this problem? I am stuck just at the end of
>>the project,
>>Cheers,
>>Paolo
>>
>>
>>> After I call dba.count I am prompted out R:
>>>
>>> terminate called after throwing an instance of 'std::out_of_range'
>>> what(): basic_string::compare
>>> Abort
>>>
>>
>>
>>2012/8/9 Paolo Kunderfranco <paolo.kunderfranco at gmail.com>:
>>> Dear Rory,
>>> Thanks again for Your input, I went through and finally obtained
>>> consensus peakset identified by both peak callers for
>>> all the samples:
>>>
>>>
>>>
>>> rm(list=ls())
>>> library("DiffBind")
>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>>
>>> test=dba(sampleSheet="database.csv")
>>> #clustering based on occupancy/peak calling
>>> pdf('clustering based on occupancy_peak calling.pdf')
>>> dba.plotHeatmap(test)
>>> dba.plotPCA(test)
>>> dev.off()
>>>
>>> par(mfrow=c(1,3))
>>>
>>> pdf('VENN_H3K27me3.pdf')
>>> dba.plotVenn(test, test$masks$ES &
>>> test$masks$H3K27,label1='mES_m',label2='mES_s')
>>> dba.plotVenn(test, test$masks$CMN &
>>> test$masks$H3K27,label1='CMp_m',label2='CMp_s')
>>> dba.plotVenn(test, test$masks$CMA &
>>> test$masks$H3K27,label1='CMa_m',label2='CMa_s')
>>> dev.off()
>>> pdf('VENN_H3K4me3.pdf')
>>> dba.plotVenn(test, test$masks$ES &
>>> test$masks$H3K4,label1='mES_m',label2='mES_s')
>>> dba.plotVenn(test, test$masks$CMN &
>>> test$masks$H3K4,label1='CMp_m',label2='CMp_s')
>>> dba.plotVenn(test, test$masks$CMA &
>>> test$masks$H3K4,label1='CMa_m',label2='CMa_s')
>>> dev.off()
>>> pdf('VENN_H3K9me3.pdf')
>>> dba.plotVenn(test, test$masks$ES &
>>> test$masks$H3K9,label1='mES_m',label2='mES_s')
>>> dba.plotVenn(test, test$masks$CMN &
>>> test$masks$H3K9,label1='CMp_m',label2='CMp_s')
>>> dba.plotVenn(test, test$masks$CMA &
>>> test$masks$H3K9,label1='CMa_m',label2='CMa_s')
>>> dev.off()
>>> pdf('VENN_H3K79me2.pdf')
>>> dba.plotVenn(test, test$masks$ES &
>>> test$masks$H3K79,label1='mES_m',label2='mES_s')
>>> dba.plotVenn(test, test$masks$CMN &
>>> test$masks$H3K79,label1='CMp_m',label2='CMp_s')
>>> dba.plotVenn(test, test$masks$CMA &
>>> test$masks$H3K79,label1='CMa_m',label2='CMa_s')
>>> dev.off()
>>> test = dba.peakset(test, test$masks$ES & test$masks$H3K27,
>>> minOverlap=2, sampID="mES_H3K27me3")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>> minOverlap=2, sampID="CMp_H3K27me3")
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>> minOverlap=2, sampID="CMa_H3K27me3")
>>> pdf('Overlap_H3K27me3.pdf')
>>> dba.plotVenn(test, test$masks$H3K27 &
>>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>>> dev.off()
>>> test = dba.peakset(test, test$masks$ES & test$masks$H3K9,
>>> minOverlap=2, sampID="mES_H3K9me3")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K9,
>>> minOverlap=2, sampID="CMp_H3K9me3")
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K9,
>>> minOverlap=2, sampID="CMa_H3K9me3")
>>> pdf('Overlap_H3K9me3.pdf')
>>> dba.plotVenn(test, test$masks$H3K9 &
>>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>>> dev.off()
>>> test = dba.peakset(test, test$masks$ES & test$masks$H3K79,
>>> minOverlap=2, sampID="mES_H3K79me2")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K79,
>>> minOverlap=2, sampID="CMp_H3K79me2")
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K79,
>>> minOverlap=2, sampID="CMa_H3K79me2")
>>> pdf('Overlap_H3K79me2.pdf')
>>> dba.plotVenn(test, test$masks$H3K79 &
>>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>>> dev.off()
>>> test = dba.peakset(test, test$masks$ES & test$masks$H3K4,
>>> minOverlap=2, sampID="mES_H3K4me3")
>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K4,
>>> minOverlap=2, sampID="CMp_H3K4me3")
>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K4,
>>> minOverlap=2, sampID="CMa_H3K4me3")
>>> pdf('Overlap_H3K4me3.pdf')
>>> dba.plotVenn(test, test$masks$H3K4 &
>>> test$masks$Consensus,label1='mES',label2='CMp',label3='CMa')
>>> dev.off()
>>>
>>> consdata = dba(test, mask = test$masks$Consensus)
>>>> consdata
>>> 12 Samples, 9754 sites in matrix (16143 total):
>>> ID Tissue Factor Condition Replicate Intervals
>>> 1 mES_H3K27me3 ES H3K27 mES_H3K27me3 1-2 1333
>>> 2 CMp_H3K27me3 CMN H3K27 CMp_H3K27me3 1-2 911
>>> 3 CMa_H3K27me3 CMA H3K27 CMa_H3K27me3 1-2 316
>>> 4 mES_H3K9me3 ES H3K9 mES_H3K9me3 1-2 89
>>> 5 CMp_H3K9me3 CMN H3K9 CMp_H3K9me3 1-2 86
>>> 6 CMa_H3K9me3 CMA H3K9 CMa_H3K9me3 1-2 323
>>> 7 mES_H3K79me2 ES H3K79 mES_H3K79me2 1-2 8590
>>> 8 CMp_H3K79me2 CMN H3K79 CMp_H3K79me2 1-2 4641
>>> 9 CMa_H3K79me2 CMA H3K79 CMa_H3K79me2 1-2 1084
>>> 10 mES_H3K4me3 ES H3K4 mES_H3K4me3 1-2 10018
>>> 11 CMp_H3K4me3 CMN H3K4 CMp_H3K4me3 1-2 8748
>>> 12 CMa_H3K4me3 CMA H3K4 CMa_H3K4me3 1-2 2998
>>>
>>> consdata = dba.count(consdata, minOverlap=1)
>>>
>>> After I call dba.count I am prompted out R:
>>>
>>> terminate called after throwing an instance of 'std::out_of_range'
>>> what(): basic_string::compare
>>> Abort
>>>
>>> Where am I wrong?
>>> Cheers,
>>> Paolo
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2012/8/8 Rory Stark <Rory.Stark at cancer.org.uk>:
>>>> Hello Paolo-
>>>>
>>>> To examine the overlaps between peak callers, you need to be working
>>>>with
>>>> the peak (occupancy) data BEFORE you settle on a global peakset used
>>>>for
>>>> counting and subsequent analysis. In your example below, you have this
>>>>in
>>>> "test" after calling "dba" but before calling "dba.count".
>>>>
>>>> So after
>>>>
>>>>> test=dba(sampleSheet="database.csv")
>>>>
>>>> you can check clustering based on occupancy/peak calling:
>>>>
>>>>> dba,plotHeatmap(test)
>>>>> dba.plotPCA(test)
>>>>
>>>>
>>>> You can also plot Venn diagrams of each of the peak caller overlaps --
>>>>eg
>>>> for the H3K27me3 mark, you could see the three sample overlaps as
>>>>follows:
>>>>
>>>>> par(mfrow=c(1,3))
>>>>> dba.plotVenn(test, test$masks$ES & test$masks$H3K27)
>>>>> dba.plotVenn(test, test$masks$CMN & test$masks$H3K27)
>>>>
>>>>> dba.plotVenn(test, test$masks$CMA & test$masks$H3K27)
>>>>
>>>>
>>>> Note that specifying minOverlap=2 to dba.count doesn't limit the
>>>>overlap
>>>> to replicate peak callers. Rather, it creates a single set of peaks
>>>>that
>>>> are identified at least twice amongst all the peaksets: either by both
>>>> peak callers for a single sample, or by the same peak caller in
>>>>different
>>>> samples, by different peak callers in different samples. If you want
>>>>to
>>>> use only peaks called by both peak callers for each sample,you can add
>>>>a
>>>> separate consensus peakset for eac condition, eg:
>>>>
>>>>> test = dba.peakset(test, test$masks$ES & test$masks$H3K27,
>>>>>minOvrlap=2, ID="mES_H3K27me3")
>>>>> test = dba.peakset(test, test$masks$CMN & test$masks$H3K27,
>>>>>minOverlap=2, ID="CMp_H3K27e3")
>>>>
>>>>> test = dba.peakset(test, test$masks$CMA & test$masks$H3K27,
>>>>>minOverlp=2, ID="CMa_H3K27me3")
>>>>
>>>>
>>>> repeated 9 more times for a total of 12 pairs. You could if you like
>>>>see
>>>> the peak overlaps of the three H3K27me3 samples at this point:
>>>>
>>>>> dba.plotVenn(test, testmasks$H3K27 & test$masks$Consensus)
>>>>
>>>> Once you have added all 12 consensus peaksets, create a new DBA object
>>>> containing only the consensus peaksets:
>>>>
>>>>> consdata = ba(test, mask = test$masks$Consensus)
>>>>
>>>> Then, if you want to use all the peaks identified by both peak callers
>>>>for
>>>> all the samples:
>>>>
>>>>> consdata = dba.count(consdata, minOverlap=1)
>>>>
>>>> And continue with the differential analysis as you had done.
>>>>
>>>> Cheers-
>>>> Rory
>>>>
>>>>>
>>>>>Dear All,
>>>>>I am using package DiffBind,
>>>>>Due to the fact that I miss real biological or technical replicates, I
>>>>>am assesing the overlapping rates between different peak callers.
>>>>>Here is how I set up my database to read in:
>>>>>
>>>>>3 Cell lines, 4 Factor, 2 Replicates, for a total of 12 Conditions
>>>>>
>>>>>> rm(list=ls())
>>>>>> setwd("/home/pkunderf/output/DiffBind/ES-CMN-CMA/")
>>>>>> test=dba(sampleSheet="database.csv")
>>>>>> test
>>>>>24 Samples, 19380 sites in matrix (40809 total):
>>>>> ID Tissue Factor Condition Replicate Intervals
>>>>>1 mES_H3K27me3_m ES H3K27 mES_H3K27me3 1 1834
>>>>>2 CMp_H3K27me3_m CMN H3K27 CMp_H3K27me3 1 2450
>>>>>3 CMa_H3K27me3_m CMA H3K27 CMa_H3K27me3 1 990
>>>>>4 mES_H3K27me3_s ES H3K27 mES_H3K27me3 2 2188
>>>>>5 CMp_H3K27me3_s CN H3K27 CMp_H3K27me3 2 3388
>>>>>6 CMa_H3K27me3_s CMA H3K27 CMa_H3K27me3 2 5946
>>>>>7 mES_H3K4me3_m ES H3K4 mES_H3K4me3 1 10243
>>>>>8 CMp_H3K4me3_m CMN H3K4 CMp_H3K4me3 1 12476
>>>>>9 CMa_H3K4me3_m CMA H3K4 CMa_H3K4me3 1 5632
>>>>>10 mES_H3K4m3_s ES H3K4 mES_H3K4me3 2 14917
>>>>>11 CMp_H3K4me3_s CMN H3K4 CMp_H3K4me3 2 10646
>>>>>12 CMa_H3K4me3_s CMA H3K4 CMa_H3K4me3 2 5985
>>>>>13 mES_H3K9me3_m ES H3K9 mES_H3K9me3 1 110
>>>>>14 CMp_H3K9me3_m CMN H3K9 CMp_H3K9me3 1 484
>>>>>15 CMa_H3K9me3_m CMA H3K9 CMa_H3K9me3 1 938
>>>>>16 mES_H3K9me3_s ES H3K9 mES_H3K9me3 2 569
>>>>>17 CMp_H3K9me3_s CMN H3K9 CMp_H3K9me3 2 808
>>>>>18 CMa_H3K9me3_s CMA H3K9 CMa_H3K9me3 2 3747
>>>>>19 mES_H3K79me2_m ES H3K79 mES_H3K79me2 1 21318
>>>>>20 CMp_H3K79me2_m CMN H3K79 CMp_H3K79me2 1 13210
>>>>>21 CMa_H3K79me2_m CMA H3K79 CMa_H3K79me2 1 2988
>>>>>22 mES_H3K79me2_s ES H3K79 mES_H3K79me2 2 22164
>>>>>23 CMp_H3K79me2_s CMN H3K79 CMp_H3K79me2 2 13429
>>>>>24 CMa_H3K79me2_s CMA H3K79 CMa_H3K79me2 2 6063
>>>>>
>>>>>
>>>>>> test=dba.count(test)
>>>>>
>>>>>> test=dba.contrast(test, categories=DBA_CONDITION,minMembers=2)
>>>>>#minMemebers was se to 2 due to
>>>>>the fact I have 2 replicates for each condition
>>>>>#....78 Contrasts:
>>>>>
>>>>>> test = dba.analyze(test)
>>>>>
>>>>>#for each condition generate a report
>>>>>>test.DB1=dba.report(test,contrast=1)
>>>>>
>>>>>
>>>>>I would like now to assess overlapping rates between replicates, and
>>>>>generate a venn diagram.
>>>>>but it seems like replicates are merged together and I cannot treat
>>>>>them as single samples.
>>>>>I noticed this behaviour also when i try to generate both a PCA plot
>>>>>and a heatmap plot, replicates are merged, so
>>>>>what's the utility to display both in the PCA or heatmap plot?
>>>>>would it be simple to display the original replicates to assess how do
>>>>>they cluster?
>>>>>
>>>>>>pdf('PCA_plot_DBA_CONDITION.pdf')
>>>>>>dba.plotPCA(test,attributes=DBA_CONDITION,
>>>>>>vColors=c("grey0","grey23","grey47","antiquewhite4","antiquewhite3","a
>>>>>>nti
>>>>>>quewhite1","dodgerblue4","dodgerblue","darkslategray1","brown4","brown
>>>>>>","
>>>>>>coral"))
>>>>>>dev.off()
>>>>>
>>>>>>pdf('HeatMap.pdf')
>>>>>>corvals=dba.plotHeatmap(test)
>>>>>>dev.off()
>>>>>
>>>>>>pdf('HeatMap_RPKM_FOLD.pdf')
>>>>>>corvals=dba.plotHeatmap(test,score=DBA_SCORE_RPKM_FOLD)
>>>>>>dev.off()
>>>>>
>>>>>> sessionInfo()
>>>>>R version 2.15.1 (2012-06-22)
>>>>>Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>>locale:
>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>[11] C_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>>attached base packages:
>>>>>[1] parallel stats graphics grDevices utils datasets methods
>>>>>[8] base
>>>>>
>>>>>other attached packages:
>>>>>[1] DiffBind_1.2.0 Biobase_2.16.0 GenomicRanges_1.8.9
>>>>>[4] IRanges_1.14.4 BiocGenerics_0.2.0
>>>>>
>>>>>loaded via a namespace (and not attached):
>>>>> [1] amap_0.8-7 edgeR_2.6.10 gdata_2.11.0
>>>>>gplots_2.11.0
>>>>> [5] gtools_2.7.0 limma_3.12.1 RColorBrewer_1.0-5
>>>>>stats4_2.15.1
>>>>> [9] tools_2.15.1 zlibbioc_1.2.0
>>>>>
>>>>>
>>>>>
>>>>>Any sugestion is appreciated,
>>>>>cheers,
>>>>>paolo
>>>>>
>>>>>
>
>
> NOTICE AND DISCLAIMER
> This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose.
>
> We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.
> Cancer Research UK
> Registered charity in England and Wales (1089464), Scotland (SC041666) and the Isle of Man (1103)
> A company limited by guarantee. Registered company in England and Wales (4325234) and the Isle of Man (5713F).
> Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.
More information about the Bioconductor
mailing list