[Bioc-devel] [BioC] readGappedAlignments and param argument
Martin Morgan
mtmorgan at fhcrc.org
Tue Sep 10 13:05:00 CEST 2013
On 09/10/2013 03:40 AM, Elena Grassi wrote:
>> You can use summarizeOverlaps with a 'BamFileList' created by something like
>>
>> myFiles = dir("/some/dir", pattern="bam$")
>> bfl = BamFileList(myFiles, yieldSize=1000000)
>> olaps = summarizeOverlaps(features, bfl)
>>
>> see the example on the help page
>>
>> method?"summarizeOverlaps,GRanges,BamFileList"
>
> I see, thanks. Right now I was looking for a solution "prior" to
> summarizeOverlaps due to the structure that I've given to my package -
> but I will re-think about it and
> check if in this way it works (I only have 20 unittest or so to
> rewrite, no worries for me :) ).
>
> Right now I've found a way to define the single chromosome GRanges
> that works with "which=..." and I would like to offer flexibility
> for people with small RAM...therefore running the whole analysis on
> one chr at a time seems a reasonable option all the same.
personally, I think it's better to iterate through the file using yieldSize
rather than 'by chromosome' (it would handle very large files, for instance,
where even a single chromosome has too many alignments to fit in memory).
The following function would be my second choice (not well-tested; hopefully a
version will appear in the next GenomicRanges), which divides seqlengths into a
GRangesList where each element contains ranges that add to approximately equal
sized widths. So in a big bam file you could just make more tiles.
Maybe the next stumbling block is collating results across tiles? The basic
strategy will be 'pre-allocate and fill'...
Martin
library(GenomicRanges)
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)
yy <- seqlengths(BamFile(fl))
yy <- yy[grep("_", names(yy), invert=TRUE)]
tile(yy, 40)
tile <-
function(seqlengths, n)
{
if (is.null(names(seqlengths)))
stop("names(seqlengths) must not be NULL")
clen <- cumsum(as.numeric(seqlengths))
breaks0 <- ceiling(seq(1L, clen[length(clen)], length.out=n + 1L))
breaks <- sort(unique(c(clen, breaks0)))[-1]
grp <- cumsum(c(TRUE, breaks[-length(breaks)] %in% breaks0))
idx <- cumsum(c(TRUE, (breaks %in% clen)[-length(breaks)]))
splt <- split(breaks, names(seqlengths)[idx])
ends <- Map(`-`, splt, c(0, clen[-length(clen)]))
starts <- lapply(ends, function(x) c(0L, x[-length(x)]) + 1L)
grg <- GRanges(rep(names(ends), sapply(ends, length)),
IRanges(unlist(starts, use.names=FALSE),
unlist(ends, use.names=FALSE)),
seqlengths=seqlengths)
reduce(splitAsList(grg, grp))
}
>
>>> http://comments.gmane.org/gmane.comp.lang.r.sequencing/755
>>
>>
>> one small thing that came out of that thread was that
>>
>> as(seqinfo(BamFile("/some/file")), "GRanges")
>>
>> gives a GRanges of all the sequence lengths.
>
> I saw that but I wanted to avoid having to do that for every bam and
> merge the results afterwards - working at the "annotation" level
> seemed more sensible to me.
>
> I guess that I should have asked my boss to have 1 month to scavenge
> around mans and vignettes before starting to write my first
> bioconductor-R package :)
> E.
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list