[Bioc-sig-seq] Another ScanBamParam suggestion

Fri Sep 30 18:11:12 CEST 2011

On 09/30/2011 07:48 AM, Ivan Gregoretti wrote:
> Following Janet's example, I would also like to propose an upgrade to
> ScanBamParam:
>
> It would be great if we could tell ScanBamPram that we want to load
> only the reads that passed the vendor's quality filter.
>
> In other words, the functionality I am suggesting is analogous to the
> filter in readAligned() from the ShortRead library.

Hi Ivan --

in principle, flag=scanBamFlag(isValidVendorRead=TRUE) will do this; it 
requires that the flag is set in the BAM file.

>
>
> With the new release of Illumina sequencing reagents (version 3) you
> get 200 million reads per lane from the HiSeq 2000. In my view, with
> samples that big becoming popular, any investment in "read in"
> efficiency is a good investment. I would be happy to provide a sample
> BAM for those interested in addressing this suggestion.
>
> It is also my humble opinion that we should start considering
> parallelisation for reading in. I hope that I am not just wishing too
> much.

I'm actually revising the I/O a little at the moment; I'll implement a 
better strategy for reading in the data. When I look at 'top' on my 
system, I see the CPU running at say 50% which implies disk input is the 
bottleneck; probably this is on our system administration end, where the 
large storage required for BAM files doesn't have completely adequate 
performance. This I/O is tricky to guage, because the next time through 
the BAM file input _is_ CPU limited and much faster -- the disk system 
has done some clever buffering. But in real use cases I wouldn't see the 
benefit of that buffering since I wouldn't be revisiting the file.

In terms of parallel throughput it might often be appropriate to 
parallelize at a higher level, e.g., iterating over regions of interest 
(e.g., GRanges defining chromosomes, with the iteration via lapply) and 
a function FUN tasked with input of data in a subset of the GRanges 
followed by processing (e.g., counting overlaps in an RNASeq experiment) 
that typically leads to a large reduction in data size. To parallelize, 
replace lapply with mclapply (currently in the multicore package, but in 
devel in  the 'parallel' package distributed with base R). Use BamFile 
and BamFileList to avoid re-loading the index on each file access. I'm 
not sure that just inputting large amounts of data in parallel and then 
pasting together to operate on as one large object is a real win -- the 
data is big, and the processing is on a single processor (unless it is 
split again in mclapply...).

I'm open to discussion on this...

Martin

>
> Thank you,
>
> Ivan
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793