[BioC] Subsampling of BAM/SAM alignments

Julian Gehring julian.gehring at embl.de
Sun Mar 17 13:31:37 CET 2013


Hi Martin,

Thanks, that is very useful.  For large-scale subsampling, the command 
line tools may still be the way to go, but for many use cases the 
BamSampler will be sufficient.

Best wishes
Julian


On 03/13/2013 09:26 PM, Martin Morgan wrote:
> On 03/12/2013 07:45 PM, Martin Morgan wrote:
>> On 03/12/2013 01:32 PM, Julian Gehring wrote:
>>> Hi,
>>>
>>> Is there an efficient function to randomly subsample alignments from
>>> a BAM/SAM
>>> within R?  In the best case, an equivalent to 'FastqSampler' in
>>> 'ShortRead'.  If
>>> not, what would be a clever way of doing this with the bioc
>>> framework, without
>>> having to read the whole file first?
>>
>> Hi Julian -- there isn't anything at the moment; below is a function
>> that will
>> have at most yieldSize * 2 elements in memory at one time. All the
>> elements in
>> the bam file (satisfying ScanBamParam) will eventually pass through
>> memory, so
>> not super efficient.
>
> I took a different path in Rsamtools 1.11.23; one can create a
> BamSampler instance and use it like BamFile, e.g., in
> readGappedAlignments / scanBam / ...
>
>  > example(BamSampler)
>
> BmSmpl> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>
> BmSmpl> samp <- BamSampler(fl, yieldSize=1000)
>
> BmSmpl> ## two independent samples
> BmSmpl> head(readBamGappedAlignments(samp))
> ...
> BmSmpl> head(readBamGappedAlignments(samp))
> ...
>
> Still not super-efficient.
>
> Martin
>
>>
>> sampleBam <-
>>      function(file, index=file, ..., yieldSize, verbose=FALSE,
>>               param=ScanBamParam(what=scanBamWhat()))
>> {
>>      qunlist <- Rsamtools:::.quickUnlist
>>      tot <- sampleSize <- yieldSize
>>      bf <- open(BamFile(file, index, yieldSize=yieldSize))
>>      smpl <- qunlist(scanBam(bf, param=param))
>>
>>      repeat {
>>          yld <- qunlist(scanBam(bf, param=param))
>>          yld_n <- length(yld[[1]])
>>          if (length(yld[[1]]) == 0L)
>>              break
>>          tot <- tot + yld_n
>>          keep <- rbinom(1L, yld_n, yld_n/ tot)
>>          if (verbose)
>>              message(sprintf("sampling %d of %d", keep, tot))
>>          if (keep == 0L)
>>              next
>>
>>          i <- sample(sampleSize, keep)
>>          j <- sample(yld_n, keep)
>>          smpl <- Map(function(x, y, i, j) {
>>              x[i] <- y[j]
>>              x
>>          }, smpl, yld, MoreArgs=list(i=i, j=j))
>>      }
>>
>>      close(bf)
>>      smpl
>> }
>>
>>
>> with
>>
>>    fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>>    param <- ScanBamParam(
>>        flag=scanBamFlag(isUnmappedQuery=FALSE),
>>        what=c("rname", "pos", "cigar", "strand"))
>>
>>    lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)
>>
>> The end result could be fed to a more friendly structure
>>
>>    names(lst)[names(lst) == "rname"] = "seqname"
>>    do.call(GappedAlignments, lst)
>>
>> Lightly tested, especially when ScanBamParam() is non-trivial.
>>
>> Martin
>>
>>>
>>> Best wishes
>>> Julian
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>



More information about the Bioconductor mailing list