[Bioc-sig-seq] [BioC] Re(surrecting): Rsamtools countBam labeling
Martin Morgan
mtmorgan at fhcrc.org
Fri Apr 1 05:27:21 CEST 2011
On 03/31/2011 12:49 PM, Cook, Malcolm wrote:
> Martin,
>
> Hmmm, Thanks .... I think I'm getting it....
>
> Following your lead, I can directly re-order cnt in the OriginalOrder as:
>
>> cnt[sort(unlist(split(values(which2), seqnames(which2)))$OriginalOrder,index.return=TRUE)$ix,]
> space start end width file records nucleotides
> 2 seq2 100 1000 901 ex1.bam 1169 41235
> 1 seq1 1000 2000 1001 ex1.bam 612 21549
> 3 seq2 1000 2000 1001 ex1.bam 642 22640
>
>
> ... which can usefully(?) be abstracted as
>
> countBamWhich<- function (file,which,index=file,...) {
> ### wrapper for countBam that reorders its results to aggree with
> ### 'which', a required ScanBamParam. ... params are additional
> ### ScanBamParam options.
> ###
> ### ASSUMES: internal implementation detail of ScanBamParam. c.f.
> ### http://permalink.gmane.org/gmane.science.biology.informatics.conductor/34208)
> param<- ScanBamParam(which=which,...)
> values(which)[['OriginalOrder']]<- 1:length(which)
> CBW = countBam(file,index,param=ScanBamParam(which=which))
> CBW[sort(unlist(split(values(which),
> seqnames(which)))$OriginalOrder,index.return=TRUE)$ix,]
> }
Hi Malcolm -- maybe the return value is a GRanges with counts? It's
probably better to split a simple vector and use that to impose order on
the result, rather than splitting values(). The return value could then
be cbind'ed as needed...
countBam_GRanges <-
function(file, index=file, ..., param=GRanges())
{
if (0L == length(param))
stop("'param' must have length() != 0")
## FIXME: not sure about 'drop' argument to split?
splt <- split(seq_len(length(param)), as.factor(seqnames(param)))
o <- order(unlist(splt, use.names=FALSE))
cnt <- countBam(file, index, ..., param=ScanBamParam(which=param))
df <- as(cnt[o, c("records", "nucleotides")], "DataFrame")
values(param) <- df
param
}
counts <- countBam_GRanges(fl, param=which1)
values(which1) <- cbind(values(which1), values(counts))
Martin
>
> allowing me to write
>
>> countBamWhich(fl, which2)
> space start end width file records nucleotides
> 2 seq2 100 1000 901 ex1.bam 1169 41235
> 1 seq1 1000 2000 1001 ex1.bam 612 21549
> 3 seq2 1000 2000 1001 ex1.bam 642 22640
>
> All in favor?
>
> ~ Malcolm
>
>
>
>
>> -----Original Message-----
>> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
>> Sent: Thursday, March 31, 2011 11:41 AM
>> To: Cook, Malcolm
>> Cc: 'myoung at wehi.EDU.AU'; 'bioconductor at r-project.org';
>> 'Bioc-sig-sequencing at r-project.org'
>> Subject: Re: [BioC] Re(surrecting): [Bioc-sig-seq] Rsamtools
>> countBam labeling
>>
>> On 03/31/2011 08:32 AM, Cook, Malcolm wrote:
>>> Matt et. al.,
>>>
>>> I wonder if a satisfactory resolution to the issue of "the order
>>> changes between the GRanges object and the countBam data.frame
>>>
>>>
>> http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg01144
>>> .html
>>>
>>> I am presented with the same issue and poised to tackle it
>> but wondr
>>> if a generic solution emerged from you inquiries& efforts.
>>
>> Hi Malcolm --
>>
>> For a reproducible example,
>>
>> library(Rsamtools)
>> example(countBam)
>> which1<- as(which, "GRanges")
>> ## which2 might be where your data actually starts
>> which2<- which1[c(2,1,3)]
>> values(which2)[["OriginalOrder"]]<- 1:3
>> param<- ScanBamParam(which=which2)
>> cnt<- countBam(fl, param=param)
>>
>> What happens is that ScanBamParam converts its argument to an
>> IRangesList, using split(ranges(which2), seqnames(which2)).
>> So do the same for the values and unlist
>>
>> cntVals<- unlist(split(values(which2), seqnames(which2)))
>>
>> then cbind coerced values
>>
>> cbind(cnt, as.data.frame(cntVals))
>>
>> with
>>
>> > which2
>> GRanges with 3 ranges and 1 elementMetadata value
>> seqnames ranges strand | OriginalOrder
>> <Rle> <IRanges> <Rle> |<integer>
>> [1] seq2 [ 100, 1000] * | 1
>> [2] seq1 [1000, 2000] * | 2
>> [3] seq2 [1000, 2000] * | 3
>>
>> seqlengths
>> seq1 seq2
>> NA NA
>> > cbind(cnt, as.data.frame(cntVals))
>> space start end width file records nucleotides OriginalOrder
>> 1 seq1 1000 2000 1001 ex1.bam 612 21549 2
>> 2 seq2 100 1000 901 ex1.bam 1169 41235 1
>> 3 seq2 1000 2000 1001 ex1.bam 642 22640 3
>>
>> Martin
>>
>>>
>>> Thanks,
>>>
>>> Malcolm Cook Stowers Institute for Medical Research -
>> Bioinformatics
>>> Kansas City, Missouri USA
>>>
>>>
>>> _______________________________________________
>> Bioconductor mailing
>>> list Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>>> archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list