[BioC] filterBam

Martin Morgan mtmorgan at fhcrc.org
Tue Jul 16 15:04:59 CEST 2013


On 07/16/2013 05:01 AM, rcaloger wrote:
> Dear Martin,
> many thanks for the information.
>
> I got the point:
> in exons.gr:
> seqlevels(exons.gr)
>   [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
> [10] "chr18" "chr19" "chr2"  "chr3"  "chr4"  "chr5"  "chr6" "chr7"  "chr8"
> [19] "chr9"  "chrX"
>
> seqlevels(BamFile("accepted_hits.sorted.bam")
>   [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
> [10] "chr18" "chr19" "chr2"  "chr3"  "chr4"  "chr5"  "chr6" "chr7"  "chr8"
> [19] "chr9"  "chrM"  "chrX"  "chrY"
>
>   "chrM" is missing in exons.gr
>
> There is any simple way to add to exons.gr also the chrM?

Thanks for the file; the missing 'chrM' should not matter, and the ranges look 
like they are sorted to me. Maybe a workaround is to use the 
indexDestination=FALSE argument to filterBam, and then sort and index the 
filtered results

   filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons",
       indexDestination=FALSE, param=ScanBamParam(which=exons.gr))
   sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted")
   indexBam("accepted_hits.tp_exons.sorted")

Martin

> you can get the test.Rda at :
> https://mail.unito.it/mailftp/uplister/get_file.php?file=ZmlsZXMvRjEvNDAvZDBjMGExMTk0OTkyYWM3YzY4Nzg0MTFlYjc1My80MzA0ZWU0MzA1OGMzYTBhMmQ1NWZjMWZjNDM4M2M5ZQ==
> Many thanks !
> Raf
>
> the file I used to generate exon.gr was ordered by chr and then by start position
>
> On 7/16/13 1:25 PM, Martin Morgan wrote:
>> On 07/15/2013 11:19 PM, rcaloger wrote:
>>> Hi,
>>> I am trying to filter a bam file using a set of coordinates.
>>> I sorted the bam file using both sortBam or Picard SortSam tools but I always
>>> get the same error:
>>>
>>> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons",
>>> param=ScanBamParam(which=exons.gr))
>>>
>>> Error in
>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>
>>> :
>>>    failed to build index
>>>    file:
>>> /data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons
>>>
>>>
>>> In addition: Warning messages:
>>> 1: In
>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>
>>> :
>>>    [bam_index_core] the alignment is not sorted (D44TDFP1_1:1:1102:6940:117318):
>>> 85079683 > 84952092 in 1-th chr
>>> 2: In
>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>
>>> :
>>>    [bam_index_build2] fail to index the BAM file.
>>
>> filterBam tries to place the GRanges into correct order by calling
>>
>>   bf <- BamFile("accepted_hits.sorted.bam")
>>   xx <- Rsamtools:::.filterBam_preprocess(bf, param)
>>
>> after which
>>
>>   bamWhich(xx)
>>
>> is supposed to have (a) non-overlapping ranges with (b) seqnames in the same
>> order as in the original BAM file and (c) ordered by start position.
>>
>> I'm supposing that this fails for your exons.gr, and would be happy to
>> trouble-shoot if you're willing to share (off-list is fine) the result of
>>
>>  save(exons.gr, seqlevels(BamFile("accepted_hits.sorted.bam")), file="test.Rda")
>>
>> Martin
>>
>>>
>>>
>>> Any suggestion how to handle this problem?
>>> cheers
>>> Raf
>>>
>>>   exons.gr
>>> GRanges with 3399 ranges and 1 metadata column:
>>>           seqnames                 ranges strand   | transcriptID
>>>              <Rle>              <IRanges> <Rle> |  <character>
>>>       [1]     chr1   [ 4764598,  4766882]      *   | uc007aff.2
>>>       [2]     chr1   [ 7110275,  7110696]      *   | uc007agb.1
>>>       [3]     chr1   [ 9858548,  9858769]      *   | uc007agw.1
>>>       [4]     chr1   [10029961, 10029968]      *   | uc007ahk.1
>>>      ...      ...                    ...    ... ...          ...
>>>      [3397]     chrX [147525139, 147525147]      *   | uc012hqq.1
>>>    [3398]     chrX [148034128, 148034929]      *   | uc009upe.1
>>>    [3399]     chrX [156357850, 156359017]      *   | uc009uss.2
>>>    ---
>>>    seqlengths:
>>>      chr1 chr10 chr11 chr12 chr13 chr14 ...  chr5  chr6  chr7 chr8 chr9  chrX
>>>        NA    NA    NA    NA    NA    NA ...    NA    NA    NA NA    NA    NA
>>>
>>> sessionInfo()
>>> R version 2.15.3 (2013-03-01)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods base
>>>
>>> other attached packages:
>>> [1] Rsamtools_1.10.2     Biostrings_2.26.3 GenomicRanges_1.10.7
>>> [4] IRanges_1.16.6       BiocGenerics_0.4.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-5    parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0
>>>
>>> -- ----------------------------------------
>>> Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di
>>> Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax
>>> ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it
>>> raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
> --
>
> ----------------------------------------
> Prof. Raffaele A. Calogero
> Bioinformatics and Genomics Unit
> MBC Centro di Biotecnologie Molecolari
> Via Nizza 52, Torino 10126
> tel.   ++39 0116706457
> Fax    ++39 0112366457
> Mobile ++39 3333827080
> email:raffaele.calogero at unito.it
>         raffaele[dot]calogero[at]gmail[dot]com
> www:http://www.bioinformatica.unito.it
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list