[BioC] filterBam
Martin Morgan
mtmorgan at fhcrc.org
Tue Jul 16 15:04:59 CEST 2013
On 07/16/2013 05:01 AM, rcaloger wrote:
> Dear Martin,
> many thanks for the information.
>
> I got the point:
> in exons.gr:
> seqlevels(exons.gr)
> [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
> [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8"
> [19] "chr9" "chrX"
>
> seqlevels(BamFile("accepted_hits.sorted.bam")
> [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
> [10] "chr18" "chr19" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8"
> [19] "chr9" "chrM" "chrX" "chrY"
>
> "chrM" is missing in exons.gr
>
> There is any simple way to add to exons.gr also the chrM?
Thanks for the file; the missing 'chrM' should not matter, and the ranges look
like they are sorted to me. Maybe a workaround is to use the
indexDestination=FALSE argument to filterBam, and then sort and index the
filtered results
filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons",
indexDestination=FALSE, param=ScanBamParam(which=exons.gr))
sortBam("accepted_hits.tp_exons", "accepted_hits.tp_exons.sorted")
indexBam("accepted_hits.tp_exons.sorted")
Martin
> you can get the test.Rda at :
> https://mail.unito.it/mailftp/uplister/get_file.php?file=ZmlsZXMvRjEvNDAvZDBjMGExMTk0OTkyYWM3YzY4Nzg0MTFlYjc1My80MzA0ZWU0MzA1OGMzYTBhMmQ1NWZjMWZjNDM4M2M5ZQ==
> Many thanks !
> Raf
>
> the file I used to generate exon.gr was ordered by chr and then by start position
>
> On 7/16/13 1:25 PM, Martin Morgan wrote:
>> On 07/15/2013 11:19 PM, rcaloger wrote:
>>> Hi,
>>> I am trying to filter a bam file using a set of coordinates.
>>> I sorted the bam file using both sortBam or Picard SortSam tools but I always
>>> get the same error:
>>>
>>> filterBam("accepted_hits.sorted.bam", "accepted_hits.tp_exons",
>>> param=ScanBamParam(which=exons.gr))
>>>
>>> Error in
>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>
>>> :
>>> failed to build index
>>> file:
>>> /data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons
>>>
>>>
>>> In addition: Warning messages:
>>> 1: In
>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>
>>> :
>>> [bam_index_core] the alignment is not sorted (D44TDFP1_1:1:1102:6940:117318):
>>> 85079683 > 84952092 in 1-th chr
>>> 2: In
>>> FUN("/data03/calogero/Documents/singapore/transcripts/semi_synthetic_data_set/for_tm_creation/m1/accepted_hits.tp_exons"[[1L]],
>>>
>>> :
>>> [bam_index_build2] fail to index the BAM file.
>>
>> filterBam tries to place the GRanges into correct order by calling
>>
>> bf <- BamFile("accepted_hits.sorted.bam")
>> xx <- Rsamtools:::.filterBam_preprocess(bf, param)
>>
>> after which
>>
>> bamWhich(xx)
>>
>> is supposed to have (a) non-overlapping ranges with (b) seqnames in the same
>> order as in the original BAM file and (c) ordered by start position.
>>
>> I'm supposing that this fails for your exons.gr, and would be happy to
>> trouble-shoot if you're willing to share (off-list is fine) the result of
>>
>> save(exons.gr, seqlevels(BamFile("accepted_hits.sorted.bam")), file="test.Rda")
>>
>> Martin
>>
>>>
>>>
>>> Any suggestion how to handle this problem?
>>> cheers
>>> Raf
>>>
>>> exons.gr
>>> GRanges with 3399 ranges and 1 metadata column:
>>> seqnames ranges strand | transcriptID
>>> <Rle> <IRanges> <Rle> | <character>
>>> [1] chr1 [ 4764598, 4766882] * | uc007aff.2
>>> [2] chr1 [ 7110275, 7110696] * | uc007agb.1
>>> [3] chr1 [ 9858548, 9858769] * | uc007agw.1
>>> [4] chr1 [10029961, 10029968] * | uc007ahk.1
>>> ... ... ... ... ... ...
>>> [3397] chrX [147525139, 147525147] * | uc012hqq.1
>>> [3398] chrX [148034128, 148034929] * | uc009upe.1
>>> [3399] chrX [156357850, 156359017] * | uc009uss.2
>>> ---
>>> seqlengths:
>>> chr1 chr10 chr11 chr12 chr13 chr14 ... chr5 chr6 chr7 chr8 chr9 chrX
>>> NA NA NA NA NA NA ... NA NA NA NA NA NA
>>>
>>> sessionInfo()
>>> R version 2.15.3 (2013-03-01)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] Rsamtools_1.10.2 Biostrings_2.26.3 GenomicRanges_1.10.7
>>> [4] IRanges_1.16.6 BiocGenerics_0.4.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-5 parallel_2.15.3 stats4_2.15.3 zlibbioc_1.4.0
>>>
>>> -- ----------------------------------------
>>> Prof. Raffaele A. Calogero Bioinformatics and Genomics Unit MBC Centro di
>>> Biotecnologie Molecolari Via Nizza 52, Torino 10126 tel. ++39 0116706457 Fax
>>> ++39 0112366457 Mobile ++39 3333827080 email: raffaele.calogero at unito.it
>>> raffaele[dot]calogero[at]gmail[dot]com www: http://www.bioinformatica.unito.it
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
> --
>
> ----------------------------------------
> Prof. Raffaele A. Calogero
> Bioinformatics and Genomics Unit
> MBC Centro di Biotecnologie Molecolari
> Via Nizza 52, Torino 10126
> tel. ++39 0116706457
> Fax ++39 0112366457
> Mobile ++39 3333827080
> email:raffaele.calogero at unito.it
> raffaele[dot]calogero[at]gmail[dot]com
> www:http://www.bioinformatica.unito.it
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list