[BioC] Problem with BAM header in GAlignments

Martin Morgan mtmorgan at fhcrc.org
Tue Jan 7 23:45:00 CET 2014


On 01/07/2014 02:31 PM, Taylor, Sean D wrote:
> I have received some BAM files from a colleague that were produced from an Illumina HiSeq instrument. When I try to read them as a GAlignments object, I get the following error:
>> bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam'
>> gal<-readGAlignments(bamfile)
> Error in value[[3L]](cond) :
>    failed to open BamFile: SAM/BAM header missing or empty
>    file: '03-192B3_lung.sort.nodup.bam.realigned.bam'
> In addition: Warning message:
> In doTryCatch(return(expr), name, parentenv, handler) :
>    [bam_header_read] invalid BAM binary header (this is not a BAM file).
>
>
> I looked at the header data this way:
>> strsplit(readLines(bamfile, 10), '\t')

this (using readLines on bamfile) implies that it is actually a sam (text) file 
rather than a bam (binary) file, despite the .bam extension! Convert it to bam 
and optionally generate an index with

   asBam(bamfile, "some_file_name.bam", indexDestination=TRUE)

or, better, get your upstream provider to generate bam files for you.

The funky information in the first line looks like the file has been corrupted 
in some way, too; you'll need to get that fixed first, which probably means 
coordinating with whomever is providing you with files. Probably you want to 
make sure that the 'checksums' on the file are the same before and after any 
copy, e.g., running the linux command md5sum *bam or from within R using 
tools::md5sum().

Martin

> [[1]]
> [1] NA
>
> [[2]]
> [1] "@SQ"          "SN:1"         "LN:249250621"
>
> [[3]]
> [1] "@SQ"          "SN:2"         "LN:243199373"
>
> [[4]]030
> [1] "@SQ"          "SN:3"         "LN:198022430"
>
> [[5]]
> [1] "@SQ"          "SN:4"         "LN:191154276"
>
> [[6]]
> [1] "@SQ"          "SN:5"         "LN:180915260"
>
> [[7]]
> [1] "@SQ"          "SN:6"         "LN:171115067"
>
> [[8]]
> [1] "@SQ"          "SN:7"         "LN:159138663"
>
> [[9]]
> [1] "@SQ"          "SN:8"         "LN:146364022"
>
> [[10]]
> [1] "@SQ"          "SN:9"         "LN:141213431"
> Warning message:
> In strsplit(readLines(bamfile, 10), "\t") :
>    input string 1 is invalid in this locale
>
> And then from the command  line:
> $ less 03-192B3_lung.sort.nodup.bam.realigned.bam
> BAM^A<CA>^K^@^@@HD      VN:1.4  GO:none SO:coordinate
> @SQ     SN:1    LN:249250621
> @SQ     SN:2    LN:243199373
> @SQ     SN:3    LN:198022430
> @SQ     SN:4    LN:191154276
> @SQ     SN:5    LN:180915260
> @SQ     SN:6    LN:171115067
> @SQ     SN:7    LN:159138663
> @SQ     SN:8    LN:146364022
> @SQ     SN:9    LN:141213431
> @SQ     SN:10   LN:135534747
>
> I'm not sure what that first line is, but I think that is the problem. The bam file itself is 18 GB, so it is difficult to edit directly. Is there a way I can direct readGAlignments to skip the first line? Or is there a different solution, eg filterBam? Ultimately I am going to be sifting through many such BAM files so I'm hoping for a general solution that doesn't involve manual editing if possible.
>
> Thanks,
> Sean
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
>   [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] latticeExtra_0.6-26  RColorBrewer_1.0-5   ShortRead_1.20.0     lattice_0.20-24
>   [5] Rsamtools_1.14.1     Biostrings_2.30.0    GenomicRanges_1.14.3 XVector_0.2.0
>   [9] IRanges_1.20.5       BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.21.7 bitops_1.0-6   grid_3.0.2     hwriter_1.3    stats4_3.0.2   tools_3.0.2
> [7] zlibbioc_1.7.0
>
>
> Sean Taylor
> Post-doctoral Fellow
> Fred Hutchinson Cancer Research Center
> 206-667-5544
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list