[BioC] Subsampling of BAM/SAM alignments
Julian Gehring
julian.gehring at embl.de
Sun Mar 17 13:31:37 CET 2013
Hi Martin,
Thanks, that is very useful. For large-scale subsampling, the command
line tools may still be the way to go, but for many use cases the
BamSampler will be sufficient.
Best wishes
Julian
On 03/13/2013 09:26 PM, Martin Morgan wrote:
> On 03/12/2013 07:45 PM, Martin Morgan wrote:
>> On 03/12/2013 01:32 PM, Julian Gehring wrote:
>>> Hi,
>>>
>>> Is there an efficient function to randomly subsample alignments from
>>> a BAM/SAM
>>> within R? In the best case, an equivalent to 'FastqSampler' in
>>> 'ShortRead'. If
>>> not, what would be a clever way of doing this with the bioc
>>> framework, without
>>> having to read the whole file first?
>>
>> Hi Julian -- there isn't anything at the moment; below is a function
>> that will
>> have at most yieldSize * 2 elements in memory at one time. All the
>> elements in
>> the bam file (satisfying ScanBamParam) will eventually pass through
>> memory, so
>> not super efficient.
>
> I took a different path in Rsamtools 1.11.23; one can create a
> BamSampler instance and use it like BamFile, e.g., in
> readGappedAlignments / scanBam / ...
>
> > example(BamSampler)
>
> BmSmpl> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>
> BmSmpl> samp <- BamSampler(fl, yieldSize=1000)
>
> BmSmpl> ## two independent samples
> BmSmpl> head(readBamGappedAlignments(samp))
> ...
> BmSmpl> head(readBamGappedAlignments(samp))
> ...
>
> Still not super-efficient.
>
> Martin
>
>>
>> sampleBam <-
>> function(file, index=file, ..., yieldSize, verbose=FALSE,
>> param=ScanBamParam(what=scanBamWhat()))
>> {
>> qunlist <- Rsamtools:::.quickUnlist
>> tot <- sampleSize <- yieldSize
>> bf <- open(BamFile(file, index, yieldSize=yieldSize))
>> smpl <- qunlist(scanBam(bf, param=param))
>>
>> repeat {
>> yld <- qunlist(scanBam(bf, param=param))
>> yld_n <- length(yld[[1]])
>> if (length(yld[[1]]) == 0L)
>> break
>> tot <- tot + yld_n
>> keep <- rbinom(1L, yld_n, yld_n/ tot)
>> if (verbose)
>> message(sprintf("sampling %d of %d", keep, tot))
>> if (keep == 0L)
>> next
>>
>> i <- sample(sampleSize, keep)
>> j <- sample(yld_n, keep)
>> smpl <- Map(function(x, y, i, j) {
>> x[i] <- y[j]
>> x
>> }, smpl, yld, MoreArgs=list(i=i, j=j))
>> }
>>
>> close(bf)
>> smpl
>> }
>>
>>
>> with
>>
>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>> param <- ScanBamParam(
>> flag=scanBamFlag(isUnmappedQuery=FALSE),
>> what=c("rname", "pos", "cigar", "strand"))
>>
>> lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)
>>
>> The end result could be fed to a more friendly structure
>>
>> names(lst)[names(lst) == "rname"] = "seqname"
>> do.call(GappedAlignments, lst)
>>
>> Lightly tested, especially when ScanBamParam() is non-trivial.
>>
>> Martin
>>
>>>
>>> Best wishes
>>> Julian
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
More information about the Bioconductor
mailing list