[BioC] Problem with BAM header in GAlignments

Taylor, Sean D sdtaylor at fhcrc.org
Wed Jan 8 00:19:23 CET 2014


Thanks Martin.

For what its worth, using asBam() did not work:
> asBam(bamfile, 'test.bam', indexDestination=TRUE)
Error in value[[3L]](cond) : 'asBam' SAM/BAM header missing or empty
  file: '03-192B3_lung.sort.nodup.bam.realigned.bam'
  SAM file: '03-192B3_lung.sort.nodup.bam.realigned.bam'
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  [samopen] no @SQ lines in the header.

Here is what the file looks like beyond the header:
@PG     ID:GATK IndelRealigner  VN:2.4-9-g532efad       CL:knownAlleles=[(RodBinding name=knownAlleles source=/net/shendure/vol7/akash/references/dbsnp_135.b37.vcf)] targetIntervals=/net/shendure/vol7/akash/ilsa/03-192B3_lung/03-192B3_lung.sort.nodup.bam.final.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_SW entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=500 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
V^@^@^@^B^@^@^@1^@=C<DB>^N^B^@^@^@2^@<8D><ED>~^N^B^@^@^@3^@^^<95><CD>^K^B^@^@^@4^@d<C8>d^K^B^@^@^@5^@<
<8C><C8>
^B^@^@^@6^@;^B3
^B^@^@^@7^@gC|  ^B^@^@^@8^@vV<B9>^H^B^@^@^@9^@<F7><BE>^C^@^@^@10^@<9B>^X^T^H^C^@^@^@11^@4       ^L^H^C
^@^@^@12^@<F7>j<FA>^G^C^@^@^@13^@VZ<DD>^F^C^@^@^@14^@$^Ff^F^C^@^@^@15^@@<81>^\^F^C^@^@^@16^@A<B4>b^E^C
^@^@^@17^@<CA><F0><D6>^D^C^@^@^@18^@@]<A7>^D^C^@^@^@19^@<97><<86>^C^C^@^@^@20^@p<B1><C1>^C^C^@^@^@21^@gg<DE>^B^C^@^@^@22^@v<D8>^N^C^B^@^@^@X^@<A0>=A        ^B^@^@^@Y^@<FE><F7><89>^C^C^@^@^@MT^@<B9>@^@^@^K^@^@^@GL000207.1^@<A6>^P^@^@^K^@^@^@GL000226.1^@<A0>:^@^@^K^@^@^@GL000229.1^@<C9>M^@^@^K^@^@^@GL000231.1
^@<FA>j^@^@^K^@^@^@GL000210.1^@"l^@^@^K^@^@^@GL000239.1^@ <84>^@^@^K^@^@^@GL000235.1^@<AA><86>^@^@^K^@
^@^@GL000201.1^@4<8D>^@^@^K^@^@^@GL000247.1^@F<8E>^@^@^K^@^@^@GL000245.1^@+<8F>^@^@^K^@^@^@GL000197.1^@7<91>^@^@^K^@^@^@GL000203.1^@z<92>^@^@^K^@^@^@GL000246.1^@

I have contacted my colleague to see if I can get a better sense of what this is supposed to be.


> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Sent: Tuesday, January 07, 2014 3:09 PM
> To: Taylor, Sean D; bioconductor at r-project.org
> Subject: Re: [BioC] Problem with BAM header in GAlignments
> 
> On 01/07/2014 02:52 PM, Taylor, Sean D wrote:
> > The thought did occur to me that these might be SAM format, but its odd
> because the information beyond the header appears to be binary. The files
> also came with accompanying .bai index files as well. I have never had cause
> to carefully examine the header, so I didn't know what to expect. I take it
> thought that the header in a BAM file is also supposed to in binary?
> >
> 
> Yes, the entire bam file, header and all, is supposed to be in a compressed
> binary representation.
> 
> Martin
> 
> > Sean
> >> -----Original Message-----
> >> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> >> Sent: Tuesday, January 07, 2014 2:45 PM
> >> To: Taylor, Sean D; bioconductor at r-project.org
> >> Subject: Re: [BioC] Problem with BAM header in GAlignments
> >>
> >> On 01/07/2014 02:31 PM, Taylor, Sean D wrote:
> >>> I have received some BAM files from a colleague that were produced
> >>> from
> >> an Illumina HiSeq instrument. When I try to read them as a
> >> GAlignments object, I get the following error:
> >>>> bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam'
> >>>> gal<-readGAlignments(bamfile)
> >>> Error in value[[3L]](cond) :
> >>>     failed to open BamFile: SAM/BAM header missing or empty
> >>>     file: '03-192B3_lung.sort.nodup.bam.realigned.bam'
> >>> In addition: Warning message:
> >>> In doTryCatch(return(expr), name, parentenv, handler) :
> >>>     [bam_header_read] invalid BAM binary header (this is not a BAM file).
> >>>
> >>>
> >>> I looked at the header data this way:
> >>>> strsplit(readLines(bamfile, 10), '\t')
> >>
> >> this (using readLines on bamfile) implies that it is actually a sam
> >> (text) file rather than a bam (binary) file, despite the .bam
> >> extension! Convert it to bam and optionally generate an index with
> >>
> >>     asBam(bamfile, "some_file_name.bam", indexDestination=TRUE)
> >>
> >> or, better, get your upstream provider to generate bam files for you.
> >>
> >> The funky information in the first line looks like the file has been
> >> corrupted in some way, too; you'll need to get that fixed first,
> >> which probably means coordinating with whomever is providing you with
> >> files. Probably you want to make sure that the 'checksums' on the
> >> file are the same before and after any copy, e.g., running the linux
> >> command md5sum *bam or from within R using tools::md5sum().
> >>
> >> Martin
> >>
> >>> [[1]]
> >>> [1] NA
> >>>
> >>> [[2]]
> >>> [1] "@SQ"          "SN:1"         "LN:249250621"
> >>>
> >>> [[3]]
> >>> [1] "@SQ"          "SN:2"         "LN:243199373"
> >>>
> >>> [[4]]030
> >>> [1] "@SQ"          "SN:3"         "LN:198022430"
> >>>
> >>> [[5]]
> >>> [1] "@SQ"          "SN:4"         "LN:191154276"
> >>>
> >>> [[6]]
> >>> [1] "@SQ"          "SN:5"         "LN:180915260"
> >>>
> >>> [[7]]
> >>> [1] "@SQ"          "SN:6"         "LN:171115067"
> >>>
> >>> [[8]]
> >>> [1] "@SQ"          "SN:7"         "LN:159138663"
> >>>
> >>> [[9]]
> >>> [1] "@SQ"          "SN:8"         "LN:146364022"
> >>>
> >>> [[10]]
> >>> [1] "@SQ"          "SN:9"         "LN:141213431"
> >>> Warning message:
> >>> In strsplit(readLines(bamfile, 10), "\t") :
> >>>     input string 1 is invalid in this locale
> >>>
> >>> And then from the command  line:
> >>> $ less 03-192B3_lung.sort.nodup.bam.realigned.bam
> >>> BAM^A<CA>^K^@^@@HD      VN:1.4  GO:none SO:coordinate
> >>> @SQ     SN:1    LN:249250621
> >>> @SQ     SN:2    LN:243199373
> >>> @SQ     SN:3    LN:198022430
> >>> @SQ     SN:4    LN:191154276
> >>> @SQ     SN:5    LN:180915260
> >>> @SQ     SN:6    LN:171115067
> >>> @SQ     SN:7    LN:159138663
> >>> @SQ     SN:8    LN:146364022
> >>> @SQ     SN:9    LN:141213431
> >>> @SQ     SN:10   LN:135534747
> >>>
> >>> I'm not sure what that first line is, but I think that is the
> >>> problem. The bam
> >> file itself is 18 GB, so it is difficult to edit directly. Is there a
> >> way I can direct readGAlignments to skip the first line? Or is there
> >> a different solution, eg filterBam? Ultimately I am going to be
> >> sifting through many such BAM files so I'm hoping for a general
> >> solution that doesn't involve manual editing if possible.
> >>>
> >>> Thanks,
> >>> Sean
> >>>
> >>>> sessionInfo()
> >>> R version 3.0.2 (2013-09-25)
> >>> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>>
> >>> locale:
> >>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >> LC_TIME=en_US.UTF-8
> >>>    [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
> >> LC_MESSAGES=en_US.UTF-8
> >>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
> >>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
> >> LC_IDENTIFICATION=C
> >>>
> >>> attached base packages:
> >>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
> >>>
> >>> other attached packages:
> >>> [1] latticeExtra_0.6-26  RColorBrewer_1.0-5   ShortRead_1.20.0
> >> lattice_0.20-24
> >>>    [5] Rsamtools_1.14.1     Biostrings_2.30.0    GenomicRanges_1.14.3
> >> XVector_0.2.0
> >>>    [9] IRanges_1.20.5       BiocGenerics_0.8.0
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] Biobase_2.21.7 bitops_1.0-6   grid_3.0.2     hwriter_1.3    stats4_3.0.2
> >> tools_3.0.2
> >>> [7] zlibbioc_1.7.0
> >>>
> >>>
> >>> Sean Taylor
> >>> Post-doctoral Fellow
> >>> Fred Hutchinson Cancer Research Center
> >>> 206-667-5544
> >>>
> >>>
> >>> 	[[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>
> >>
> >> --
> >> Computational Biology / Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >>
> >> Location: Arnold Building M1 B861
> >> Phone: (206) 667-2793
> 
> 
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793



More information about the Bioconductor mailing list