[BioC] CIGAR and seq lengths differ after trimming
Martin Morgan
mtmorgan at fhcrc.org
Fri May 31 01:53:55 CEST 2013
On 05/29/2013 04:02 PM, Sam McInturf wrote:
> Bioconductor,
> I am working on an RNA sequencing experiment on Arabidopsis (Illumina,
> 100bp, single end), using tophat to map the reads, and R for the rest.
> Many (9 of 12) libraries worked with out error. Three of the libraries
> had no `accepted_hit.bam` file, while they did have a junctions.bed, and
> had an error of "Error: could not convert to BAM with samtools". in the
> standard error output.
>
> Additionally in tophat_out/log/accepted_hits_sam_to_bam.log
> [samopen] SAM header is present: 7 sequences.
> Parse error at line 19971338: CIGAR and sequence length are inconsistent
Hi Sam -- maybe you can discover what this line of the sam file looks like, and
compare the sequence there to the sequence in the fastq file, both before and
after trimming. Martin
>
> Obviously, somewhere the CIGAR or sequence lengths were altered and no
> longer 'mesh'.
> I run :
>
> library(ShortRead)
> reads <- readFastq("myFile")
> treads <- trimEnds(reads, a = "5") #trim at a quality score of 20
> nreads <- treads[width(reads) > 21] # keep only reads longer than 21 bp
> writeFastq(nreads, "myNewFile.fastq")
>
> then a 'normal' (few options changed) tophat call
> tophat -p 4 -o outdir -G my.gtf BowtieIndex myNewFile.fastq
>
> Any ideas on what went wrong, and why it didn't work some of the time?
> Should I just rerun the trimming scripts and see if they map afterwards?
>
>
> Thanks in advance!
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] ShortRead_1.18.0 latticeExtra_0.6-24 RColorBrewer_1.0-5
> [4] Rsamtools_1.12.3 lattice_0.20-15 Biostrings_2.28.0
> [7] GenomicRanges_1.12.3 IRanges_1.18.1 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.20.0 bitops_1.0-5 grid_3.0.0 hwriter_1.3
> stats4_3.0.0
> [6] zlibbioc_1.6.0
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list