[Bioc-sig-seq] chipseq infrastructure

Wed Mar 3 13:14:14 CET 2010

Thanks Vincent. This looks good. 
On 2010-03-02, at 11:01 PM, Vincent Carey wrote:

> NB: At the end of the comments I meant to write "pretty paltry", not
> "pretty" ... it isn't.
> 
> On Tue, Mar 2, 2010 at 10:59 PM, Vincent Carey
> <stvjc at channing.harvard.edu> wrote:
>> Here is an example of using Rsamtools to reproduce a little bit of the
>> resources associated with some ChIP-seq data from the paper by R
>> Lister et al (Nature 2009 PMID 19829295)
>> 
>> 1) a BAM file was retrieved from GEO, specifically
>> 
>> GSM466738_UCSD.H1.H3K36me3.UCSD_42B3BAAXX_6.bam
>> 
>> This file had to be sorted and indexed using samtools; I added 'srt' before .bam
>> 
>> 2) the GEO page for the series includes a reference to a browser; i
>> went to chromosome 2
>> 
>> http://www.ncbi.nlm.nih.gov/nuccore/89161199?report=graph&tracks=key:sequence_track;key:graph_track,annots:NA000000149.1|NA000000150.1|NA000000146.1|NA000000148.1|NA000000169.1|NA000000170.1;key:gene_model_track,RNAs:false,CDSs:false,Exons:false,annots:Unnamed;key:feature_track,subkey:misc_RNA,annots:other
>> 
>> and zoomed into a region around 132M + strand, where I saw something
>> that looked like a peak in H3K36me3
>> 
>> 3) Now we can make a BamViews instance based on the GSM466738 file,
>> and focus on any region of interest by setting bamRanges.
>> 
>> library(Rsamtools)
>> zz = BamViews(bamPaths="GSM466738_UCSD.H1.H3K36me3.UCSD_313BFAAXX_6srt.bam")
>> bamRanges(zz) = RangedData(IRanges(132500000,133000000), space="chr2")
>> 
>> 4) The alignment information can be retrieved and coverage data extracted
>> 
>> dd = readBAMasGappedAlignments(zz)
>> cc = coverage(dd[[1]])$chr2
>> ccx = cumsum(cc at lengths)
>> ccy = cc at values
>> 
>> plot(ccx, ccy) # shows the peak
>> 
>> 5) The ease with which detailed information on aligned reads can be
>> extracted using Rsamtools has led me to favor BAM as a 'hub' format
>> for those considering working with Bioconductor for NGS data.  The
>> voluminous resources are left outside of R, and we can bring things in
>> at various stages of refinement.  More examples of efficient
>> reproduction of published NGS results using bioc are desired -- this
>> one is pretty and is offered mainly to start getting a little more
>> concrete on possible roles of Rsamtools in downstream workflows.
>> 
>> On Mon, Mar 1, 2010 at 7:46 PM, Raphael Gottardo
>> <raphaelgottardo at mac.com> wrote:
>>> Hi Michael (and others),
>>> 
>>> I would certainly second that. You guys have develop great tools for low level analysis of next gen data, but higher level analysis are still lagging behind. Though, this is rather normal as the higher level stuff needs the lower level infrastructure.
>>> 
>>> My group has been working on several aspects of chip-seq analysis and to some extend gene regulation.
>>> As noted in one of the email this morning, we are about to submit our PICS software based on a version of this paper http://arxiv.org/abs/0903.3206, which we hope will be published in Biometrics in the near future. For our package we have used some of the infrastructure available in the chip-seq package, and IRanges.
>>> 
>>> One the problem we have faced is data input. In chipseq, one does not need sequence reads. However, when you use ShortReads you automatically get the sequence reads which takes a lot of memory. For some highly sequenced data we have, it has been somewhat of a bottleneck.
>>> So it would be nice to be able to only read the chr/start/strand information. As pointed out by Wolfgang, rsamtools might be the solution, so we will have to see how we can use rsamtools and the classes defined there for chip-seq. This being said we still have a lot of files from non MAQ aligners.
>>> I think Arnaud Droit, who is in my group, has sent an email about this issue already.
>>> 
>>> Besides PICS that will be submitted this week, we have already released a package for motif analyses, rGADEM, which can work on standard Biostrings objects. rGADEM is relatively fast and well adapted for ChIP-seq enriched regions. We also have another package, MotIV for motif validation and identification based which is based on STAMP (with many improved functionalities). MotIV is under review I believe and should be available soon.
>>> 
>>> Anyway, so very soon we will have a complete pipeline from shortread -> enriched regions (PICS) -> motifs (rGADEM) -> validated motifs and motif occurrences (MotIV) -> other BioC packages (e.g. GenomicsFeatures, etc).
>>> 
>>> So at least this will be a start. Of course we are open to suggestions/requests, etc. If any of you guys want more details feel free to drop us an email.
>>> 
>>> Cheers,
>>> 
>>> Raphael
>>> 
>>> On 2010-03-01, at 10:08 AM, Michael Lawrence wrote:
>>> 
>>>> Hey guys,
>>>> 
>>>> I'm wondering if anyone has given any thought to some sort of generic
>>>> framework for chipseq analysis in Bioconductor, based on the IRanges,
>>>> Biostrings, etc infrastructure. chipseq has some nice utilities; could it be
>>>> transformed into some sort of generic chipseq pipeline? Something like how
>>>> the 'affy' package (I think?) allows other packages to provide alternative
>>>> implementations for particular stages. Just having a clean, refined,
>>>> approximately complete set of chipseq-focused utilities would be nice.
>>>> Presumably chipseq could fill that role? I think we now have a good idea of
>>>> the basic steps in chipseq analysis, so it's probably time for such a
>>>> package to emerge.
>>>> 
>>>> Comments?
>>>> 
>>>> Michael
>>>> 
>>>>       [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> 
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> 
>>