[BioC] Subsampling of BAM/SAM alignments

Martin Morgan mtmorgan at fhcrc.org
Wed Mar 13 21:26:12 CET 2013


On 03/12/2013 07:45 PM, Martin Morgan wrote:
> On 03/12/2013 01:32 PM, Julian Gehring wrote:
>> Hi,
>>
>> Is there an efficient function to randomly subsample alignments from a BAM/SAM
>> within R?  In the best case, an equivalent to 'FastqSampler' in 'ShortRead'.  If
>> not, what would be a clever way of doing this with the bioc framework, without
>> having to read the whole file first?
>
> Hi Julian -- there isn't anything at the moment; below is a function that will
> have at most yieldSize * 2 elements in memory at one time. All the elements in
> the bam file (satisfying ScanBamParam) will eventually pass through memory, so
> not super efficient.

I took a different path in Rsamtools 1.11.23; one can create a BamSampler 
instance and use it like BamFile, e.g., in readGappedAlignments / scanBam / ...

 > example(BamSampler)

BmSmpl> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")

BmSmpl> samp <- BamSampler(fl, yieldSize=1000)

BmSmpl> ## two independent samples
BmSmpl> head(readBamGappedAlignments(samp))
...
BmSmpl> head(readBamGappedAlignments(samp))
...

Still not super-efficient.

Martin

>
> sampleBam <-
>      function(file, index=file, ..., yieldSize, verbose=FALSE,
>               param=ScanBamParam(what=scanBamWhat()))
> {
>      qunlist <- Rsamtools:::.quickUnlist
>      tot <- sampleSize <- yieldSize
>      bf <- open(BamFile(file, index, yieldSize=yieldSize))
>      smpl <- qunlist(scanBam(bf, param=param))
>
>      repeat {
>          yld <- qunlist(scanBam(bf, param=param))
>          yld_n <- length(yld[[1]])
>          if (length(yld[[1]]) == 0L)
>              break
>          tot <- tot + yld_n
>          keep <- rbinom(1L, yld_n, yld_n/ tot)
>          if (verbose)
>              message(sprintf("sampling %d of %d", keep, tot))
>          if (keep == 0L)
>              next
>
>          i <- sample(sampleSize, keep)
>          j <- sample(yld_n, keep)
>          smpl <- Map(function(x, y, i, j) {
>              x[i] <- y[j]
>              x
>          }, smpl, yld, MoreArgs=list(i=i, j=j))
>      }
>
>      close(bf)
>      smpl
> }
>
>
> with
>
>    fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>    param <- ScanBamParam(
>        flag=scanBamFlag(isUnmappedQuery=FALSE),
>        what=c("rname", "pos", "cigar", "strand"))
>
>    lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)
>
> The end result could be fed to a more friendly structure
>
>    names(lst)[names(lst) == "rname"] = "seqname"
>    do.call(GappedAlignments, lst)
>
> Lightly tested, especially when ScanBamParam() is non-trivial.
>
> Martin
>
>>
>> Best wishes
>> Julian
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list