[BioC] Subsampling of BAM/SAM alignments

Martin Morgan mtmorgan at fhcrc.org
Wed Mar 13 03:45:01 CET 2013

On 03/12/2013 01:32 PM, Julian Gehring wrote:
> Hi,
> Is there an efficient function to randomly subsample alignments from a BAM/SAM
> within R?  In the best case, an equivalent to 'FastqSampler' in 'ShortRead'.  If
> not, what would be a clever way of doing this with the bioc framework, without
> having to read the whole file first?

Hi Julian -- there isn't anything at the moment; below is a function that will 
have at most yieldSize * 2 elements in memory at one time. All the elements in 
the bam file (satisfying ScanBamParam) will eventually pass through memory, so 
not super efficient.

sampleBam <-
     function(file, index=file, ..., yieldSize, verbose=FALSE,
     qunlist <- Rsamtools:::.quickUnlist
     tot <- sampleSize <- yieldSize
     bf <- open(BamFile(file, index, yieldSize=yieldSize))
     smpl <- qunlist(scanBam(bf, param=param))

     repeat {
         yld <- qunlist(scanBam(bf, param=param))
         yld_n <- length(yld[[1]])
         if (length(yld[[1]]) == 0L)
         tot <- tot + yld_n
         keep <- rbinom(1L, yld_n, yld_n/ tot)
         if (verbose)
             message(sprintf("sampling %d of %d", keep, tot))
         if (keep == 0L)

         i <- sample(sampleSize, keep)
         j <- sample(yld_n, keep)
         smpl <- Map(function(x, y, i, j) {
             x[i] <- y[j]
         }, smpl, yld, MoreArgs=list(i=i, j=j))



   fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
   param <- ScanBamParam(
       what=c("rname", "pos", "cigar", "strand"))

   lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)

The end result could be fed to a more friendly structure

   names(lst)[names(lst) == "rname"] = "seqname"
   do.call(GappedAlignments, lst)

Lightly tested, especially when ScanBamParam() is non-trivial.


> Best wishes
> Julian
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

More information about the Bioconductor mailing list