[Bioc-sig-seq] Another ScanBamParam suggestion
mtmorgan at fhcrc.org
Fri Sep 30 18:11:12 CEST 2011
On 09/30/2011 07:48 AM, Ivan Gregoretti wrote:
> Following Janet's example, I would also like to propose an upgrade to
> It would be great if we could tell ScanBamPram that we want to load
> only the reads that passed the vendor's quality filter.
> In other words, the functionality I am suggesting is analogous to the
> filter in readAligned() from the ShortRead library.
Hi Ivan --
in principle, flag=scanBamFlag(isValidVendorRead=TRUE) will do this; it
requires that the flag is set in the BAM file.
> With the new release of Illumina sequencing reagents (version 3) you
> get 200 million reads per lane from the HiSeq 2000. In my view, with
> samples that big becoming popular, any investment in "read in"
> efficiency is a good investment. I would be happy to provide a sample
> BAM for those interested in addressing this suggestion.
> It is also my humble opinion that we should start considering
> parallelisation for reading in. I hope that I am not just wishing too
I'm actually revising the I/O a little at the moment; I'll implement a
better strategy for reading in the data. When I look at 'top' on my
system, I see the CPU running at say 50% which implies disk input is the
bottleneck; probably this is on our system administration end, where the
large storage required for BAM files doesn't have completely adequate
performance. This I/O is tricky to guage, because the next time through
the BAM file input _is_ CPU limited and much faster -- the disk system
has done some clever buffering. But in real use cases I wouldn't see the
benefit of that buffering since I wouldn't be revisiting the file.
In terms of parallel throughput it might often be appropriate to
parallelize at a higher level, e.g., iterating over regions of interest
(e.g., GRanges defining chromosomes, with the iteration via lapply) and
a function FUN tasked with input of data in a subset of the GRanges
followed by processing (e.g., counting overlaps in an RNASeq experiment)
that typically leads to a large reduction in data size. To parallelize,
replace lapply with mclapply (currently in the multicore package, but in
devel in the 'parallel' package distributed with base R). Use BamFile
and BamFileList to avoid re-loading the index on each file access. I'm
not sure that just inputting large amounts of data in parallel and then
pasting together to operate on as one large object is a real win -- the
data is big, and the processing is on a single processor (unless it is
split again in mclapply...).
I'm open to discussion on this...
> Thank you,
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing