[Bioc-devel] [BioC] readGappedAlignments and param argument

Tue Sep 10 12:25:33 CEST 2013

On 09/10/2013 12:22 AM, Elena Grassi wrote:
> Ok, thanks - I misread the manual, I just thought that was a way to
> filter out reads falling over certain regions.
> After reading the alignments I use summarizeOverlaps to obtain counts
> with the desired options. I decided to use scanBamParam(which=..)

You can use summarizeOverlaps with a 'BamFileList' created by something like

   myFiles = dir("/some/dir", pattern="bam$")
   bfl = BamFileList(myFiles, yieldSize=1000000)
   olaps = summarizeOverlaps(features, bfl)

see the example on the help page

   method?"summarizeOverlaps,GRanges,BamFileList"

(there's helpful tab completion after 'method?"sum<tab>'). If you've also said

   library(parallel)
   options(mc.cores=detectCores())  ## how many cores to use?

then the bam files are processed in parallel (choose yieldSize so that the 
different threads are not competing for memory).

> because loading bams and counting overlaps is using a lot of RAM for
> big bams and I found online various suggestions to use the

more generally a strategy is to use yieldSize and iterate through the file, like

   bf = BamFile(fl, yieldsSize=1000000)
   open(bf)
   repeat {
       curr = readBamGappedAlignments(bf) ## read up to yieldSize records
       if (length(curr) == 0)
           break ## done
       ## process chunk 'curr'
   }
   close(bf)

this is actually what summarizeOverlaps does when invoked on a BamFile (or 
BamFileList().

> ScanBamParam to load one chr at a time...but as long as I would like
> to be general I decided to extract all the seqlevels in the GRanges
> object that I'm using and then I would like to iterate over those
> extracting the reads that fall on different seqlevels.
> I noticed that the counts where different with this approach with
> respect to doing all the work together. I guess that I have to find
> another way to split my alignments...but I'd really like to avoid
> something like:
> http://comments.gmane.org/gmane.comp.lang.r.sequencing/755

one small thing that came out of that thread was that

   as(seqinfo(BamFile("/some/file")), "GRanges")

gives a GRanges of all the sequence lengths.

> Reducing my features would not work as long as I need to keep the
> original number of the reads...I guess I will have to "flatten" all
> the regions falling on a single chr of my Granges in a way that avoid
> overlapping reads, that should do the trick (I hope...)...maybe I
> could extract the smaller-bigger coords for a given chr and obtain
> chr-based Granges from there.

>
> Thank you very much for your help,
> E.
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793