[BioC] DiffBind - sample sheet for multiple replicates and peak
Rory Stark
Rory.Stark at cancer.org.uk
Fri Sep 14 13:06:09 CEST 2012
Hello António-
Regarding your sample sheet, you should be able to use the same sample ID and the same read files for each sample, and specify the peak caller as MACS and QuEST as in your initial sample sheet. This should load two sets of peak for each sample, differing only peak caller name (as DiffBind doesn't recognize the MACS or QuEST strings, it will default to raw, which is the format you are using). Does this work for you? I've attached an example sample sheet using the sample tamoxifen data included with DiffBind.
The next issue is deriving "a set of common peaks for the peak callers ". You can do this using all the peaks from all the callers for all the replicates but invoking dba.count directly (by default, any peak that is identified at least twice is included; these could come form two peak callers for a single replicate, or from single peak callers for different replicates). You may want to perform an intermediate step of deriving consensus peaksets for each replicate. In the development version of DiffBind, this is easy done with the call:
data = dba.peakset(data, consensus = -DBA_CALLER)
Which will add a consensus peakset for each sample (consisting of peak identified in both peak callers for that replicate). I have added a feature to dba.count to make it easy to use only these peaksets in deriving the master consensus peakset used for counting:
data = dba.count(data, peaks=data$masks$Consensus)
But this won't be in the build for a few more days. In the meantime you have to retrieve the peakset and pass it to dba.count:
consensus = dba(data,data$masks$Consensus) # make new DBA object with only the consensus peaksets
conspeaks = dba.peakset(consensus,bRetrieve=T) # retrieve master consensus peakset
data = dba.count(data,peaks=conspeaks) # pass in master consensus peakset for counting
As Gord said in another response, the reads must be in BED or BAM format (possible zipped); you can easily convert SAM files to BAM files.
Hope this helps!
Cheers-
Rory
On 14/09/2012 11:00, "bioconductor-request at r-project.org<mailto:bioconductor-request at r-project.org>" <bioconductor-request at r-project.org<mailto:bioconductor-request at r-project.org>> wrote:
------------------------------
Message: 10
Date: Thu, 13 Sep 2012 18:06:05 +0200
From: Ant?nio Miguel de Jesus Domingues <amjdomingues at gmail.com<mailto:amjdomingues at gmail.com>>
To: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
Subject: [BioC] DiffBind - sample sheet for multiple replicates and
peak callers
Message-ID:
<CAPaCvoBu89nEKmnu-2joJcZybqWdHd=ut=b5O4hL10k+axXgNQ at mail.gmail.com<mailto:CAPaCvoBu89nEKmnu-2joJcZybqWdHd=ut=b5O4hL10k+axXgNQ at mail.gmail.com>>
Content-Type: text/plain
Hi all,
I am trying to use DiffBind to compare peaks called in control vs
condition. I have 2 replicates for each and I've also called peaks using 2
different peak callers (to wi, MACS and QuEST). I've also prepared a sample
data sheet that looks like this:
SampleID Tissue Factor Condition Replicate Peak.caller bamReads
bamControl Peaks
control Hela TF wt 1
MACS path path path
control Hela TF wt 1
QuEST path path path
control2 Hela TF wt 2
MACS path path path
control 2 Hela TF wt 2
QuEST path path path
(and the same for the conditions)
My plan was to load all the data and then using diffbind selecte a set of
common peaks for the peak callers before proceeding with the analysis.
However, when I load the data (data = dba(sampleSheet="samplesheet.csv"))
the peaks for each caller are not recognized as a different variable. How
I can do that and is this silly?
I could also derive a set of common peaks independently but it would be
neat to do it all with the same package and that seems to be possible but I
could not find how to do it in the documentation.
Thanks,
Ant?nio
--
--
Ant?nio Miguel de Jesus Domingues, PhD
Neugebauer group
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden
Pfotenhauerstrasse 108
01307 Dresden
Germany
e-mail: domingue at mpi-cbg.de<mailto:domingue at mpi-cbg.de>
tel. +49 351 210 2481
The Unbearable Lightness of Molecular Biology
[[alternative HTML version deleted]]
------------------------------
------------------------------
Message: 20
Date: Fri, 14 Sep 2012 11:23:26 +0200
From: Paolo Kunderfranco <paolo.kunderfranco at gmail.com<mailto:paolo.kunderfranco at gmail.com>>
To: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
Subject: Re: [BioC] DiffBind - sample sheet for multiple replicates
and peak
Message-ID:
<CAGxWFc-bYWWW9mJMR7nAdcofNLJ=ZyM3sssbc2mYNyDhE5+haQ at mail.gmail.com<mailto:CAGxWFc-bYWWW9mJMR7nAdcofNLJ=ZyM3sssbc2mYNyDhE5+haQ at mail.gmail.com>>
Content-Type: text/plain; charset=ISO-8859-1
Dear Antonio
I think that this problem was resolved in previous messagges, have a look to:
https://stat.ethz.ch/pipermail/bioconductor/2012-August/047351.html?
For the complete code you could browse August mailing list.
SampleID should be unique for each sample, and moreover also bam file
file should be unique, you should make a copy of all your bamReads and
bamControl
In my example, I compared 2 different peak callers in 3 cell lines:
SampleID,Tissue,Factor,Condition,Replicate,bamReads,bamControl,Peaks
mES_H3K27me3_m,ES,H3K27,mES_H3K27me3,1,reads/H3K27me3/ES_H3K27me3_m.bed,reads/H3K27me3/ES_input_m.bed,peaks/H3K27me3_ES_M.bed
CMp_H3K27me3_m,CMN,H3K27,CMp_H3K27me3,1,reads/H3K27me3/CMN_H3K27me3_m.bed,reads/H3K27me3/CMN_input_m.bed,peaks/H3K27me3_CMN_M.bed
CMa_H3K27me3_m,CMA,H3K27,CMa_H3K27me3,1,reads/H3K27me3/CMA_H3K27me3_m.bed,reads/H3K27me3/CMA_input_m.bed,peaks/H3K27me3_CMA_M.bed
mES_H3K27me3_s,ES,H3K27,mES_H3K27me3,2,reads/H3K27me3/ES_H3K27me3_s.bed,reads/H3K27me3/ES_input_s.bed,peaks/H3K27me3_ES_S.bed
CMp_H3K27me3_s,CMN,H3K27,CMp_H3K27me3,2,reads/H3K27me3/CMN_H3K27me3_s.bed,reads/H3K27me3/CMN_input_s.bed,peaks/H3K27me3_CMN_S.bed
CMa_H3K27me3_s,CMA,H3K27,CMa_H3K27me3,2,reads/H3K27me3/CMA_H3K27me3_s.bed,reads/H3K27me3/CMA_input_s.bed,peaks/H3K27me3_CMA_S.bed
Like this it should work,
Cheers,
Paolo
------------------------------
Message: 21
Date: Fri, 14 Sep 2012 11:58:09 +0200
From: Ant?nio Miguel de Jesus Domingues <amjdomingues at gmail.com<mailto:amjdomingues at gmail.com>>
To: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
Subject: [BioC] DiffBind - error in dba.count
Message-ID:
<CAPaCvoC+9hGAXZaTgJ7J3GUKTEmaummn6SjhoXM1B4YGwVR4RQ at mail.gmail.com<mailto:CAPaCvoC+9hGAXZaTgJ7J3GUKTEmaummn6SjhoXM1B4YGwVR4RQ at mail.gmail.com>>
Content-Type: text/plain
Hi again,
I am trying DiffBind and loaded my data that looks like this:
H3K4m3
4 Samples, 13203 sites in matrix (13792 total):
ID Tissue Factor Condition Peak.caller Replicate Intervals
1 wt1 Hela H3K4me3 control1 raw 1 14111
2 wt2 Hela H3K4me3 control2 raw 2 13771
3 treat1 Hela H3K4me3 condition1 raw 1 14865
4 treat2 Hela H3K4me3 condition2 raw 2 13393
But I ran into problems trying to calculate the affinity scores with
dba.count:
H3K4m3 = dba.count(H3K4m3)
Error in cond$counts : $ operator is invalid for atomic vectors
In addition: Warning message:
In mclapply(arglist, fn, ..., mc.preschedule = FALSE) :
6 function calls resulted in an error
The peaks are in bed files (chr, start, end, score) and the reads are in
SAM format.
Can anyone help me with this?
Cheers.
Ant?nio
sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] DiffBind_1.0.9 Biobase_2.14.0
loaded via a namespace (and not attached):
[1] IRanges_1.12.6 RColorBrewer_1.0-5 amap_0.8-7
edgeR_2.4.6
[5] gdata_2.11.0 gplots_2.11.0 gtools_2.7.0
limma_3.10.3
[9] zlibbioc_1.0.1
On 13 September 2012 18:06, Ant?nio Miguel de Jesus Domingues <
amjdomingues at gmail.com<mailto:amjdomingues at gmail.com>> wrote:
Hi all,
I am trying to use DiffBind to compare peaks called in control vs
condition. I have 2 replicates for each and I've also called peaks using 2
different peak callers (to wi, MACS and QuEST). I've also prepared a sample
data sheet that looks like this:
SampleID Tissue Factor Condition Replicate Peak.caller bamReads
bamControl Peaks
control Hela TF wt 1
MACS path path path
control Hela TF wt 1
QuEST path path path
control2 Hela TF wt 2
MACS path path path
control 2 Hela TF wt 2
QuEST path path path
(and the same for the conditions)
My plan was to load all the data and then using diffbind selecte a set of
common peaks for the peak callers before proceeding with the analysis.
However, when I load the data (data = dba(sampleSheet="samplesheet.csv"))
the peaks for each caller are not recognized as a different variable. How
I can do that and is this silly?
I could also derive a set of common peaks independently but it would be
neat to do it all with the same package and that seems to be possible but I
could not find how to do it in the documentation.
Thanks,
Ant?nio
--
--
Ant?nio Miguel de Jesus Domingues, PhD
Neugebauer group
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden
Pfotenhauerstrasse 108
01307 Dresden
Germany
e-mail: domingue at mpi-cbg.de<mailto:domingue at mpi-cbg.de>
tel. +49 351 210 2481
The Unbearable Lightness of Molecular Biology
--
--
Ant?nio Miguel de Jesus Domingues, PhD
Neugebauer group
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden
Pfotenhauerstrasse 108
01307 Dresden
Germany
e-mail: domingue at mpi-cbg.de<mailto:domingue at mpi-cbg.de>
tel. +49 351 210 2481
The Unbearable Lightness of Molecular Biology
[[alternative HTML version deleted]]
------------------------------
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
End of Bioconductor Digest, Vol 115, Issue 14
*********************************************
NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose.
We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.
Cancer Research UK
Registered charity in England and Wales (1089464), Scotland (SC041666) and the Isle of Man (1103)
A company limited by guarantee. Registered company in England and Wales (4325234) and the Isle of Man (5713F).
Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.
More information about the Bioconductor
mailing list