[BioC] Subsampling of BAM/SAM alignments
Martin Morgan
mtmorgan at fhcrc.org
Wed Mar 13 21:26:12 CET 2013
On 03/12/2013 07:45 PM, Martin Morgan wrote:
> On 03/12/2013 01:32 PM, Julian Gehring wrote:
>> Hi,
>>
>> Is there an efficient function to randomly subsample alignments from a BAM/SAM
>> within R? In the best case, an equivalent to 'FastqSampler' in 'ShortRead'. If
>> not, what would be a clever way of doing this with the bioc framework, without
>> having to read the whole file first?
>
> Hi Julian -- there isn't anything at the moment; below is a function that will
> have at most yieldSize * 2 elements in memory at one time. All the elements in
> the bam file (satisfying ScanBamParam) will eventually pass through memory, so
> not super efficient.
I took a different path in Rsamtools 1.11.23; one can create a BamSampler
instance and use it like BamFile, e.g., in readGappedAlignments / scanBam / ...
> example(BamSampler)
BmSmpl> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
BmSmpl> samp <- BamSampler(fl, yieldSize=1000)
BmSmpl> ## two independent samples
BmSmpl> head(readBamGappedAlignments(samp))
...
BmSmpl> head(readBamGappedAlignments(samp))
...
Still not super-efficient.
Martin
>
> sampleBam <-
> function(file, index=file, ..., yieldSize, verbose=FALSE,
> param=ScanBamParam(what=scanBamWhat()))
> {
> qunlist <- Rsamtools:::.quickUnlist
> tot <- sampleSize <- yieldSize
> bf <- open(BamFile(file, index, yieldSize=yieldSize))
> smpl <- qunlist(scanBam(bf, param=param))
>
> repeat {
> yld <- qunlist(scanBam(bf, param=param))
> yld_n <- length(yld[[1]])
> if (length(yld[[1]]) == 0L)
> break
> tot <- tot + yld_n
> keep <- rbinom(1L, yld_n, yld_n/ tot)
> if (verbose)
> message(sprintf("sampling %d of %d", keep, tot))
> if (keep == 0L)
> next
>
> i <- sample(sampleSize, keep)
> j <- sample(yld_n, keep)
> smpl <- Map(function(x, y, i, j) {
> x[i] <- y[j]
> x
> }, smpl, yld, MoreArgs=list(i=i, j=j))
> }
>
> close(bf)
> smpl
> }
>
>
> with
>
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(
> flag=scanBamFlag(isUnmappedQuery=FALSE),
> what=c("rname", "pos", "cigar", "strand"))
>
> lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)
>
> The end result could be fed to a more friendly structure
>
> names(lst)[names(lst) == "rname"] = "seqname"
> do.call(GappedAlignments, lst)
>
> Lightly tested, especially when ScanBamParam() is non-trivial.
>
> Martin
>
>>
>> Best wishes
>> Julian
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list