[Bioc-sig-seq] Loading large BAM files

Martin Morgan mtmorgan at fhcrc.org
Wed Jul 13 22:57:45 CEST 2011


On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:
> Hi everybody,
>
> As I wait for my large BAM to be read in by scanBAM, I can't help but to wonder:
>
> Has anybody tried combining scanBam with multicore to load the
> chromosomes in parallel?
>
> That would require
>
> 1) to merge the chunks at the end and
>
> 2) the original BAM to be indexed.
>
> Does anybody have any experience to share?

Was wondering how large and long we're talking about?

Use of ScanBamParam(what=...) can help.

For some tasks I'd think of a coarser granularity, e.g., in the context 
of multiple bam files so that the data reduction (to a vector of 
10,000's of counts) occurs on each core.

   counter = function(fl, genes) {
       aln = readGappedAlignments(fl)
       strand(aln) = "*"
       hits = countOverlaps(aln, genes)
       countOverlaps(genes, aln[hits==1])
   }
   simplify2array(mclapply(bamFiles, counter, genes))

One issue I understand people have is that mclapply uses 'serialize()' 
to convert the return value of each function to a raw vector. raw 
vectors have the same total length limit as any other vector (2^31 -1 
elements) and this places a limit on the size of chunk returned by each 
core. Also I believe that exceeding the limit can silently corrupt the 
data (i.e., a bug). This is second-hand information.

Martin

>
> Thank you,
>
> Ivan
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list