[Bioc-sig-seq] Loading large BAM files

Martin Morgan mtmorgan at fhcrc.org
Fri Jul 15 07:18:04 CEST 2011


On 07/14/2011 08:47 AM, Ivan Gregoretti wrote:
> Hello Martin,
>
> I am currently reading in data with scanBam and making use of the
> ScanBamParam function.
>
> My bam data looks like this:
>
>> names(bam1)
> [1] "chr1:1-197195432" "chr2:1-181748087" "chr3:1-159599783"
>
>> class(bam1)
> [1] "list"
>
>> class(bam1$"chr1:1-197195432")
> [1] "list"
>
>> class(bam1$"chr1:1-197195432"$qname)
> [1] "character"
>
> How do we consolidate all three lists (chr1, chr2, chr3) into a single list?
>
> I read the Rsamtools documentation but the examples do not quite apply
> to this case.

These are lists-of-lists, and the somewhat opaque way of getting 
elements out is along the lines of

   unlist(lapply(bam1, "[[", "qname"))

This paradigm seems not to work for XStringSet, for which

   do.call(c, unname(lapply(bam1, "[[", "seq")))

Martin


>
> Thank you,
>
> Ivan
>
>
>
>
>
> On Wed, Jul 13, 2011 at 5:25 PM, Ivan Gregoretti<ivangreg at gmail.com>  wrote:
>> Hi Martin,
>>
>> I will have to play a little bit both with the counter function and
>> with simplify2array. It's not evident for me what they are intended to
>> do.
>>
>> I can answer your question about BAM size. Our samples are all from
>> Illumina HiSeq now. So, we are talking about no less than 100 million
>> good quality tags per lane.
>>
>> In different occasions I have run into the problem of the 2^31 -1
>> limit and I am always thinking about ways to process large amounts of
>> information by chunks and in parallel. With time, sequencers will only
>> output more tags so we will will run into this problems only more
>> often.
>>
>> The suggestion of using ScanBamParam is good, thank you, but at this
>> early stage of sample analysis I need to load the whole BAM. Some
>> sample quality assessments are in order.
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>>
>> On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan<mtmorgan at fhcrc.org>  wrote:
>>> On 07/13/2011 01:57 PM, Martin Morgan wrote:
>>>>
>>>> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote:
>>>>>
>>>>> Hi everybody,
>>>>>
>>>>> As I wait for my large BAM to be read in by scanBAM, I can't help but
>>>>> to wonder:
>>>>>
>>>>> Has anybody tried combining scanBam with multicore to load the
>>>>> chromosomes in parallel?
>>>>>
>>>>> That would require
>>>>>
>>>>> 1) to merge the chunks at the end and
>>>>>
>>>>> 2) the original BAM to be indexed.
>>>>>
>>>>> Does anybody have any experience to share?
>>>>
>>>> Was wondering how large and long we're talking about?
>>>>
>>>> Use of ScanBamParam(what=...) can help.
>>>>
>>>> For some tasks I'd think of a coarser granularity, e.g., in the context
>>>> of multiple bam files so that the data reduction (to a vector of
>>>> 10,000's of counts) occurs on each core.
>>>>
>>>> counter = function(fl, genes) {
>>>> aln = readGappedAlignments(fl)
>>>> strand(aln) = "*"
>>>> hits = countOverlaps(aln, genes)
>>>> countOverlaps(genes, aln[hits==1])
>>>> }
>>>> simplify2array(mclapply(bamFiles, counter, genes))
>>>>
>>>> One issue I understand people have is that mclapply uses 'serialize()'
>>>> to convert the return value of each function to a raw vector. raw
>>>> vectors have the same total length limit as any other vector (2^31 -1
>>>> elements) and this places a limit on the size of chunk returned by each
>>>> core. Also I believe that exceeding the limit can silently corrupt the
>>>> data (i.e., a bug). This is second-hand information.
>>>
>>> Should also have mentioned that casting a GRanges object to RangesList
>>> provides the appropriate list to iterate over chromosomes, and that
>>> ScanBamParam will accept a which of class RangesList.
>>>
>>> Martin
>>>
>>>>
>>>> Martin
>>>>
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Ivan
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>>
>>>
>>>
>>> --
>>> Computational Biology
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>>
>>> Location: M1-B861
>>> Telephone: 206 667-2793
>>>
>>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list