[BioC] Problem with BAM header in GAlignments
Taylor, Sean D
sdtaylor at fhcrc.org
Wed Jan 8 00:19:23 CET 2014
Thanks Martin.
For what its worth, using asBam() did not work:
> asBam(bamfile, 'test.bam', indexDestination=TRUE)
Error in value[[3L]](cond) : 'asBam' SAM/BAM header missing or empty
file: '03-192B3_lung.sort.nodup.bam.realigned.bam'
SAM file: '03-192B3_lung.sort.nodup.bam.realigned.bam'
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
[samopen] no @SQ lines in the header.
Here is what the file looks like beyond the header:
@PG ID:GATK IndelRealigner VN:2.4-9-g532efad CL:knownAlleles=[(RodBinding name=knownAlleles source=/net/shendure/vol7/akash/references/dbsnp_135.b37.vcf)] targetIntervals=/net/shendure/vol7/akash/ilsa/03-192B3_lung/03-192B3_lung.sort.nodup.bam.final.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_SW entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=500 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
V^@^@^@^B^@^@^@1^@=C<DB>^N^B^@^@^@2^@<8D><ED>~^N^B^@^@^@3^@^^<95><CD>^K^B^@^@^@4^@d<C8>d^K^B^@^@^@5^@<
<8C><C8>
^B^@^@^@6^@;^B3
^B^@^@^@7^@gC| ^B^@^@^@8^@vV<B9>^H^B^@^@^@9^@<F7><BE>^C^@^@^@10^@<9B>^X^T^H^C^@^@^@11^@4 ^L^H^C
^@^@^@12^@<F7>j<FA>^G^C^@^@^@13^@VZ<DD>^F^C^@^@^@14^@$^Ff^F^C^@^@^@15^@@<81>^\^F^C^@^@^@16^@A<B4>b^E^C
^@^@^@17^@<CA><F0><D6>^D^C^@^@^@18^@@]<A7>^D^C^@^@^@19^@<97><<86>^C^C^@^@^@20^@p<B1><C1>^C^C^@^@^@21^@gg<DE>^B^C^@^@^@22^@v<D8>^N^C^B^@^@^@X^@<A0>=A ^B^@^@^@Y^@<FE><F7><89>^C^C^@^@^@MT^@<B9>@^@^@^K^@^@^@GL000207.1^@<A6>^P^@^@^K^@^@^@GL000226.1^@<A0>:^@^@^K^@^@^@GL000229.1^@<C9>M^@^@^K^@^@^@GL000231.1
^@<FA>j^@^@^K^@^@^@GL000210.1^@"l^@^@^K^@^@^@GL000239.1^@ <84>^@^@^K^@^@^@GL000235.1^@<AA><86>^@^@^K^@
^@^@GL000201.1^@4<8D>^@^@^K^@^@^@GL000247.1^@F<8E>^@^@^K^@^@^@GL000245.1^@+<8F>^@^@^K^@^@^@GL000197.1^@7<91>^@^@^K^@^@^@GL000203.1^@z<92>^@^@^K^@^@^@GL000246.1^@
I have contacted my colleague to see if I can get a better sense of what this is supposed to be.
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Sent: Tuesday, January 07, 2014 3:09 PM
> To: Taylor, Sean D; bioconductor at r-project.org
> Subject: Re: [BioC] Problem with BAM header in GAlignments
>
> On 01/07/2014 02:52 PM, Taylor, Sean D wrote:
> > The thought did occur to me that these might be SAM format, but its odd
> because the information beyond the header appears to be binary. The files
> also came with accompanying .bai index files as well. I have never had cause
> to carefully examine the header, so I didn't know what to expect. I take it
> thought that the header in a BAM file is also supposed to in binary?
> >
>
> Yes, the entire bam file, header and all, is supposed to be in a compressed
> binary representation.
>
> Martin
>
> > Sean
> >> -----Original Message-----
> >> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> >> Sent: Tuesday, January 07, 2014 2:45 PM
> >> To: Taylor, Sean D; bioconductor at r-project.org
> >> Subject: Re: [BioC] Problem with BAM header in GAlignments
> >>
> >> On 01/07/2014 02:31 PM, Taylor, Sean D wrote:
> >>> I have received some BAM files from a colleague that were produced
> >>> from
> >> an Illumina HiSeq instrument. When I try to read them as a
> >> GAlignments object, I get the following error:
> >>>> bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam'
> >>>> gal<-readGAlignments(bamfile)
> >>> Error in value[[3L]](cond) :
> >>> failed to open BamFile: SAM/BAM header missing or empty
> >>> file: '03-192B3_lung.sort.nodup.bam.realigned.bam'
> >>> In addition: Warning message:
> >>> In doTryCatch(return(expr), name, parentenv, handler) :
> >>> [bam_header_read] invalid BAM binary header (this is not a BAM file).
> >>>
> >>>
> >>> I looked at the header data this way:
> >>>> strsplit(readLines(bamfile, 10), '\t')
> >>
> >> this (using readLines on bamfile) implies that it is actually a sam
> >> (text) file rather than a bam (binary) file, despite the .bam
> >> extension! Convert it to bam and optionally generate an index with
> >>
> >> asBam(bamfile, "some_file_name.bam", indexDestination=TRUE)
> >>
> >> or, better, get your upstream provider to generate bam files for you.
> >>
> >> The funky information in the first line looks like the file has been
> >> corrupted in some way, too; you'll need to get that fixed first,
> >> which probably means coordinating with whomever is providing you with
> >> files. Probably you want to make sure that the 'checksums' on the
> >> file are the same before and after any copy, e.g., running the linux
> >> command md5sum *bam or from within R using tools::md5sum().
> >>
> >> Martin
> >>
> >>> [[1]]
> >>> [1] NA
> >>>
> >>> [[2]]
> >>> [1] "@SQ" "SN:1" "LN:249250621"
> >>>
> >>> [[3]]
> >>> [1] "@SQ" "SN:2" "LN:243199373"
> >>>
> >>> [[4]]030
> >>> [1] "@SQ" "SN:3" "LN:198022430"
> >>>
> >>> [[5]]
> >>> [1] "@SQ" "SN:4" "LN:191154276"
> >>>
> >>> [[6]]
> >>> [1] "@SQ" "SN:5" "LN:180915260"
> >>>
> >>> [[7]]
> >>> [1] "@SQ" "SN:6" "LN:171115067"
> >>>
> >>> [[8]]
> >>> [1] "@SQ" "SN:7" "LN:159138663"
> >>>
> >>> [[9]]
> >>> [1] "@SQ" "SN:8" "LN:146364022"
> >>>
> >>> [[10]]
> >>> [1] "@SQ" "SN:9" "LN:141213431"
> >>> Warning message:
> >>> In strsplit(readLines(bamfile, 10), "\t") :
> >>> input string 1 is invalid in this locale
> >>>
> >>> And then from the command line:
> >>> $ less 03-192B3_lung.sort.nodup.bam.realigned.bam
> >>> BAM^A<CA>^K^@^@@HD VN:1.4 GO:none SO:coordinate
> >>> @SQ SN:1 LN:249250621
> >>> @SQ SN:2 LN:243199373
> >>> @SQ SN:3 LN:198022430
> >>> @SQ SN:4 LN:191154276
> >>> @SQ SN:5 LN:180915260
> >>> @SQ SN:6 LN:171115067
> >>> @SQ SN:7 LN:159138663
> >>> @SQ SN:8 LN:146364022
> >>> @SQ SN:9 LN:141213431
> >>> @SQ SN:10 LN:135534747
> >>>
> >>> I'm not sure what that first line is, but I think that is the
> >>> problem. The bam
> >> file itself is 18 GB, so it is difficult to edit directly. Is there a
> >> way I can direct readGAlignments to skip the first line? Or is there
> >> a different solution, eg filterBam? Ultimately I am going to be
> >> sifting through many such BAM files so I'm hoping for a general
> >> solution that doesn't involve manual editing if possible.
> >>>
> >>> Thanks,
> >>> Sean
> >>>
> >>>> sessionInfo()
> >>> R version 3.0.2 (2013-09-25)
> >>> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>>
> >>> locale:
> >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> >> LC_TIME=en_US.UTF-8
> >>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
> >> LC_MESSAGES=en_US.UTF-8
> >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
> >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
> >> LC_IDENTIFICATION=C
> >>>
> >>> attached base packages:
> >>> [1] parallel stats graphics grDevices utils datasets methods base
> >>>
> >>> other attached packages:
> >>> [1] latticeExtra_0.6-26 RColorBrewer_1.0-5 ShortRead_1.20.0
> >> lattice_0.20-24
> >>> [5] Rsamtools_1.14.1 Biostrings_2.30.0 GenomicRanges_1.14.3
> >> XVector_0.2.0
> >>> [9] IRanges_1.20.5 BiocGenerics_0.8.0
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] Biobase_2.21.7 bitops_1.0-6 grid_3.0.2 hwriter_1.3 stats4_3.0.2
> >> tools_3.0.2
> >>> [7] zlibbioc_1.7.0
> >>>
> >>>
> >>> Sean Taylor
> >>> Post-doctoral Fellow
> >>> Fred Hutchinson Cancer Research Center
> >>> 206-667-5544
> >>>
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>
> >>
> >> --
> >> Computational Biology / Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >>
> >> Location: Arnold Building M1 B861
> >> Phone: (206) 667-2793
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
More information about the Bioconductor
mailing list