[Bioc-sig-seq] Loading large BAM files

Martin Morgan mtmorgan at fhcrc.org
Wed Jul 13 23:04:17 CEST 2011


On 07/13/2011 01:57 PM, Martin Morgan wrote:
> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:
>> Hi everybody,
>>
>> As I wait for my large BAM to be read in by scanBAM, I can't help but
>> to wonder:
>>
>> Has anybody tried combining scanBam with multicore to load the
>> chromosomes in parallel?
>>
>> That would require
>>
>> 1) to merge the chunks at the end and
>>
>> 2) the original BAM to be indexed.
>>
>> Does anybody have any experience to share?
>
> Was wondering how large and long we're talking about?
>
> Use of ScanBamParam(what=...) can help.
>
> For some tasks I'd think of a coarser granularity, e.g., in the context
> of multiple bam files so that the data reduction (to a vector of
> 10,000's of counts) occurs on each core.
>
> counter = function(fl, genes) {
> aln = readGappedAlignments(fl)
> strand(aln) = "*"
> hits = countOverlaps(aln, genes)
> countOverlaps(genes, aln[hits==1])
> }
> simplify2array(mclapply(bamFiles, counter, genes))
>
> One issue I understand people have is that mclapply uses 'serialize()'
> to convert the return value of each function to a raw vector. raw
> vectors have the same total length limit as any other vector (2^31 -1
> elements) and this places a limit on the size of chunk returned by each
> core. Also I believe that exceeding the limit can silently corrupt the
> data (i.e., a bug). This is second-hand information.

Should also have mentioned that casting a GRanges object to RangesList 
provides the appropriate list to iterate over chromosomes, and that 
ScanBamParam will accept a which of class RangesList.

Martin

>
> Martin
>
>>
>> Thank you,
>>
>> Ivan
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list