[Bioc-devel] [BioC] readGappedAlignments and param argument

Tue Sep 10 13:05:00 CEST 2013

On 09/10/2013 03:40 AM, Elena Grassi wrote:
>> You can use summarizeOverlaps with a 'BamFileList' created by something like
>>
>>    myFiles = dir("/some/dir", pattern="bam$")
>>    bfl = BamFileList(myFiles, yieldSize=1000000)
>>    olaps = summarizeOverlaps(features, bfl)
>>
>> see the example on the help page
>>
>>    method?"summarizeOverlaps,GRanges,BamFileList"
>
> I see, thanks. Right now I was looking for a solution "prior" to
> summarizeOverlaps due to the structure that I've given to my package -
> but I will re-think about it and
> check if in this way it works (I only have 20 unittest or so to
> rewrite, no worries for me :) ).
>
> Right now I've found a way to define the single chromosome GRanges
> that works with "which=..." and I would like to offer flexibility
> for people with small RAM...therefore running the whole analysis on
> one chr at a time seems a reasonable option all the same.

personally, I think it's better to iterate through the file using yieldSize 
rather than 'by chromosome' (it would handle very large files, for instance, 
where even a single chromosome has too many alignments to fit in memory).

The following function would be my second choice (not well-tested; hopefully a 
version will appear in the next GenomicRanges), which divides seqlengths into a 
GRangesList where each element contains ranges that add to approximately equal 
sized widths. So in a big bam file you could just make more tiles.

Maybe the next stumbling block is collating results across tiles? The basic 
strategy will be 'pre-allocate and fill'...

Martin

library(GenomicRanges)
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)
yy <- seqlengths(BamFile(fl))
yy <- yy[grep("_", names(yy), invert=TRUE)]
tile(yy, 40)

tile <-
     function(seqlengths, n)
{
     if (is.null(names(seqlengths)))
         stop("names(seqlengths) must not be NULL")

     clen <- cumsum(as.numeric(seqlengths))
     breaks0 <- ceiling(seq(1L, clen[length(clen)], length.out=n + 1L))
     breaks <- sort(unique(c(clen, breaks0)))[-1]
     grp <- cumsum(c(TRUE, breaks[-length(breaks)] %in% breaks0))

     idx <- cumsum(c(TRUE, (breaks %in% clen)[-length(breaks)]))
     splt <- split(breaks, names(seqlengths)[idx])
     ends <- Map(`-`, splt, c(0, clen[-length(clen)]))
     starts <- lapply(ends, function(x) c(0L, x[-length(x)]) + 1L)

     grg <- GRanges(rep(names(ends), sapply(ends, length)),
                    IRanges(unlist(starts, use.names=FALSE),
                            unlist(ends, use.names=FALSE)),
                    seqlengths=seqlengths)
     reduce(splitAsList(grg, grp))
}

>
>>> http://comments.gmane.org/gmane.comp.lang.r.sequencing/755
>>
>>
>> one small thing that came out of that thread was that
>>
>>    as(seqinfo(BamFile("/some/file")), "GRanges")
>>
>> gives a GRanges of all the sequence lengths.
>
> I saw that but I wanted to avoid having to do that for every bam and
> merge the results afterwards - working at the "annotation" level
> seemed more sensible to me.
>
> I guess that I should have asked my boss to have 1 month to scavenge
> around mans and vignettes before starting to write my first
> bioconductor-R package :)
> E.
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793