[BioC] filterBam

Tue Jul 16 17:18:17 CEST 2013

On 07/16/2013 06:20 AM, rcaloger wrote:
> Hi Martin,
> thanks again for the quick response.
> I managed to build the filtered bam file:
>
> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons",
>        indexDestination=FALSE, param=ScanBamParam(which=exons.gr))
>    sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted")
>    indexBam("accepted_hits.tp_exons.sorted.bam")

OK, good. I think what happens is that you specify two ranges

    |-----|        |-----|

and then a read overlapping the second range starts before a read overlapping 
the first range

1     x-x
2   x---------------x
    |-----|        |-----|

The bam file filters the first range and writes read 1, then filters the second 
range and writes read 2. The output file is no longer sorted even though the 
input file was sorted. In some ways I'm ok with this requiring that the new BAM 
file needs to be sorted then indexed.

There is a second hazard here, and that is because read 2 overlaps both ranges, 
it is duplicated in the output file. This is probably not at all what is 
desired, and I'll implement a fix.

Martin

>
> Cheers
> raf
>
> On 7/16/13 3:04 PM, Martin Morgan wrote:
>> On 07/16/2013 05:01 AM, rcaloger wrote:
>>> Dear Martin,
>>> many thanks for the information.
>>>
>>> I got the point:
>>> in exons.gr:
>>> seqlevels(exons.gr)
>>>   [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
>>> [10] "chr18" "chr19" "chr2"  "chr3"  "chr4"  "chr5"  "chr6" "chr7" "chr8"
>>> [19] "chr9"  "chrX"
>>>
>>> seqlevels(BamFile("accepted_hits.sorted.bam")
>>>   [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
>>> [10] "chr18" "chr19" "chr2"  "chr3"  "chr4"  "chr5"  "chr6" "chr7" "chr8"
>>> [19] "chr9"  "chrM"  "chrX"  "chrY"
>>>
>>>   "chrM" is missing in exons.gr
>>>
>>> There is any simple way to add to exons.gr also the chrM?
>>
>> Thanks for the file; the missing 'chrM' should not matter, and the ranges look
>> like they are sorted to me. Maybe a workaround is to use the
>> indexDestination=FALSE argument to filterBam, and then sort and index the
>> filtered results
>>
>>   filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons",
>>       indexDestination=FALSE, param=ScanBamParam(which=exons.gr))
>>   sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted")
>>   indexBam("accepted_hits.tp_exons.sorted")
>>
>> Martin
>>
>>> you can get the test.Rda at :
>>> https://mail.unito.it/mailftp/uplister/get_file.php?file=ZmlsZXMvRjEvNDAvZDBjMGExMTk0OTkyYWM3YzY4Nzg0MTFlYjc1My80MzA0ZWU0MzA1OGMzYTBhMmQ1NWZjMWZjNDM4M2M5ZQ==
>>>
>>> Many thanks !
>>> Raf
>>>
>>> the file I used to generate exon.gr was ordered by chr and then by start
>>> position
>>>
>>> On 7/16/13 1:25 PM, Martin Morgan wrote:
>>>> On 07/15/2013 11:19 PM, rcaloger wrote:
>>>>> Hi,
>>>>> I am trying to filter a bam file using a set of coordinates.
>>>>> I sorted the bam file using both sortBam or Picard SortSam tools but I always
>>>>> get the same error:
>>>>>
>>>>> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons",
>>>>> param=ScanBamParam(which=exons.gr))
>>>>>
>>>>> Error in
>>>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>>>
>>>>>
>>>>> :
>>>>>    failed to build index
>>>>>    file:
>>>>> /data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons
>>>>>
>>>>>
>>>>>
>>>>> In addition: Warning messages:
>>>>> 1: In
>>>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>>>
>>>>>
>>>>> :
>>>>>    [bam_index_core] the alignment is not sorted
>>>>> (D44TDFP1_1:1:1102:6940:117318):
>>>>> 85079683 > 84952092 in 1-th chr
>>>>> 2: In
>>>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>>>
>>>>>
>>>>> :
>>>>>    [bam_index_build2] fail to index the BAM file.
>>>>
>>>> filterBam tries to place the GRanges into correct order by calling
>>>>
>>>>   bf <- BamFile("accepted_hits.sorted.bam")
>>>>   xx <- Rsamtools:::.filterBam_preprocess(bf, param)
>>>>
>>>> after which
>>>>
>>>>   bamWhich(xx)
>>>>
>>>> is supposed to have (a) non-overlapping ranges with (b) seqnames in the same
>>>> order as in the original BAM file and (c) ordered by start position.
>>>>
>>>> I'm supposing that this fails for your exons.gr, and would be happy to
>>>> trouble-shoot if you're willing to share (off-list is fine) the result of
>>>>
>>>>  save(exons.gr, seqlevels(BamFile("accepted_hits.sorted.bam")),
>>>> file="test.Rda")
>>>>
>>>> Martin
>>>>
>>>>>
>>>>>
>>>>> Any suggestion how to handle this problem?
>>>>> cheers
>>>>> Raf
>>>>>
>>>>>   exons.gr
>>>>> GRanges with 3399 ranges and 1 metadata column:
>>>>>           seqnames                 ranges strand   | transcriptID
>>>>>              <Rle>              <IRanges> <Rle> |  <character>
>>>>>       [1]     chr1   [ 4764598,  4766882]      *   | uc007aff.2
>>>>>       [2]     chr1   [ 7110275,  7110696]      *   | uc007agb.1
>>>>>       [3]     chr1   [ 9858548,  9858769]      *   | uc007agw.1
>>>>>       [4]     chr1   [10029961, 10029968]      *   | uc007ahk.1
>>>>>      ...      ...                    ...    ... ... ...
>>>>>      [3397]     chrX [147525139, 147525147]      *   | uc012hqq.1
>>>>>    [3398]     chrX [148034128, 148034929]      *   | uc009upe.1
>>>>>    [3399]     chrX [156357850, 156359017]      *   | uc009uss.2
>>>>>    ---
>>>>>    seqlengths:
>>>>>      chr1 chr10 chr11 chr12 chr13 chr14 ...  chr5  chr6 chr7 chr8 chr9  chrX
>>>>>        NA    NA    NA    NA    NA    NA ...    NA    NA NA NA NA    NA
>>>>>
>>>>> sessionInfo()
>>>>> R version 2.15.3 (2013-03-01)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets methods base
>>>>>
>>>>> other attached packages:
>>>>> [1] Rsamtools_1.10.2     Biostrings_2.26.3 GenomicRanges_1.10.7
>>>>> [4] IRanges_1.16.6       BiocGenerics_0.4.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] bitops_1.0-5    parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0
>>>>>
>>>>> -- ----------------------------------------
>>>>> Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di
>>>>> Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax
>>>>> ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it
>>>>> raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> ----------------------------------------
>>> Prof. Raffaele A. Calogero
>>> Bioinformatics and Genomics Unit
>>> MBC Centro di Biotecnologie Molecolari
>>> Via Nizza 52, Torino 10126
>>> tel.   ++39 0116706457
>>> Fax    ++39 0112366457
>>> Mobile ++39 3333827080
>>> email:raffaele.calogero at unito.it
>>>         raffaele[dot]calogero[at]gmail[dot]com
>>> www:http://www.bioinformatica.unito.it
>>>
>>
>>
>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793