[BioC] Advice on reading big BED/BAM and ChIP-seq quality control

Michael Stadler michael.stadler at fmi.ch
Tue May 14 09:06:24 CEST 2013


Dear Hari,

Reading all alignments of a sample (either from a bed or a bam file)
into memory may not be sustainable.

One way to get around that would be to work on streams, i.e. only
loading a chunk of the alignments at a time. The Rsamtools package
provides such functionality for bam files (see ?BamFile and "yield"
therein).

Alternatively, I would like to point out the QuasR package to you
(apologies for the advertisement). We have avoided the memory issue by
traversing the bam files at the C level, and only extracting the
information that is required. The function qQCrepoprt() may fulfill your
requirement 3), and the functions qCount() and qProfile() may be what
you can use to do 4).

QuasR was designed to begin the analysis with reads and also create the
bam files for you; you can however also start with pre-existing bam
files and still use a good part of its functionality (see vignette
section 5.1, "BAM" files).

I hope this helps,
Michael



Reading alignments

On 13.05.2013 15:43, Hari Easwaran wrote:
> Dear Bioc gurus,
> 
> I am a newbie with using R tools for ChIP-seq analyses and seek advice on
> the best way to go about a data set I have.  Following are the file formats
> I have and what I would like to do with them:
> 
> 1) Using Samtools, I created BED files (about 5 Gb) from the BAM files (3-4
> Gb)
> 
> 2) Want to read the BED files (or BAM files) into R.
> 
> 3) Perform quality control plots (like the number of duplicated reads
> across the samples because the nature of ChIP-seq processing is different
> in some of the samples, and so I want to know what bias it introduces).
> 
> 4) Be able to retrieve specific genomic regions for exploration and
> visualization of reads/peaks in the context of genomic annotations (I guess
> to have in a format so that I can play with GenomicRanges).
> 
> 
> I am doing all this in a cluster with fairly good memory capacities (about
> 18G; Or perhaps I think it is 'good' memory). I went through the mailing
> list and found some very useful discussion on reading BED/BAM files:
> 
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-March/001900.html
> 
> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-September/002242.html
> 
> I thought BED files will be easy to work with because it already has data
> in a format that I understand (chromosome, start, end, tags). I tried the
> 'import' function from rtracklayer, as suggested in the above link, to read
> the BED file. However, it didn't work as I run out of memory.
> 
>>From the discussions, it seems an alternative is Rsamtools to read BAM
> files.  Before I go about with trying Rsamtools, I would be happy to get
> some advice on whether I am on the right track by using Rsamtools, and if
> any other packages/tools might have in-built functions to achieve what I
> want with the data.
> 
> Thanks for your time.
> 
> 
> Sincerely
> 
> Hari Easwaran
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
--------------------------------------------
Michael Stadler, PhD
Head of Computational Biology
Friedrich Miescher Institute
Basel (Switzerland)
Phone : +41 61 697 6492
Fax   : +41 61 697 3976
Mail  : michael.stadler at fmi.ch



More information about the Bioconductor mailing list