[Bioc-sig-seq] Problem with ShortRead reading quality scores from bowtie

Wed Aug 5 18:21:56 CEST 2009

Hi Faud --

Fuad Gwadry wrote:
> Hi Martin.
>  
> My data is generated from the old pipeline (illumina 0.3) and 
> therefore should not be a problem. In any case I tested your suggestion 
> but it had no effect on the data. The colMeans(m) are the same before 
> and after incorporating your suggestions. Also when I use alignQuality I 
> get NAs in both cases. Let me know how I can help to trouble shoot this. 

NA's in alignQuality are present when the reads did not align; not all 
of the values will be NA. see below for the issue with base quality scores.

> -- Fuad
>  
> The first part is using my original data:
>  
>  > aln
> class: AlignedRead
> length: 4933275 reads; width: 32 cycles
> chromosome: phiX174 phiX174 ... phiX174 phiX174
> position: 5147 1513 ... 5034 3107
> strand: - - ... + +
> alignQuality: NumericQuality
> alignData varLabels: similar mismatch
>  > head(quality(aln))
> class: SFastqQuality
> quality:
>   A BStringSet instance of length 6
>     width seq
> [1]    32 <<<<<<<<<<<<<<<<<;<<<;<<<<<5.<,:
> [2]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> [3]    32 <<<<<<<<<<<<;<<;<<<<<<<+<<<<<<<<
> [4]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> [5]    32 <<<<<<<<<<<<<<<<<<<;<<<<<<<<<<6<
> [6]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>  > m <- as(quality(aln), "matrix")
>  > colMeans(m)
>  [1] -3.865733 -3.867642 -3.878435 -3.869599 -3.884619 -3.891333 -3.904493
>  [8] -3.933454 -3.965951 -3.995224 -4.301351 -4.331125 -4.427614 -4.495761
> [15] -4.556334 -4.586264 -4.623881 -4.653223 -4.744307 -4.825347 -4.913756
> [22] -5.000141 -5.138716 -5.213431 -5.375823 -5.635962 -5.779450 -5.791400
> [29] -5.944196 -6.175808 -6.368913 -6.597291
>  > alignQuality(aln)
> class: NumericQuality
> quality: NA NA ... NA NA (4933275 total)
>  
> ##################
> incorporating your suggestions
> ##################
> 
>  > qual <- FastqQuality(quality(quality(aln)))
>  > qual
> class: FastqQuality
> quality:
>   A BStringSet instance of length 4933275
>           width seq
>       [1]    32 <<<<<<<<<<<<<<<<<;<<<;<<<<<5.<,:
>       [2]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>       [3]    32 <<<<<<<<<<<<;<<;<<<<<<<+<<<<<<<<
>       [4]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>       [5]    32 <<<<<<<<<<<<<<<<<<<;<<<<<<<<<<6<
>       [6]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>       [7]    32 <<<<<<<<<<<<<<<<+<-<<<<;<<<<<<<<
>       [8]    32 <<<<<<<<<<<<<<<<<<<<<<<<<,<+<<<<
>       [9]    32 <<<<<<<<<<<<<<<<<<<5<<<<7<<<<<<2
>       ...   ... ...
> [4933267]    32 <<<<<<<<<<<<<<<<<<<<<<3<<8<<<<<<
> [4933268]    32 <<<<<<<<<<<<<<<<<<<<<;<<<98<<<<<
> [4933269]    32 <<<<<<<<<<<<<<<<</<<<5<<1;<;<7<<
> [4933270]    32 <<<<<<<<<<<<<<<<<<<<<<<<;<<<<<6:
> [4933271]    32 <<<<<<<<73;<;<<<<<;';57<;<0;<<55
> [4933272]    32 <<<<<<3<<<<<<<<6<<<<<<-7<<<7*<<7
> [4933273]    32 <<<<<<<<<<<<<3<<<<<<+<<6<<<4&5<6
> [4933274]    32 <<<<<<<<<<<<<<<<<<<<<<<<4<<<<<<<
> [4933275]    32 <<<<<<<<<<<<<<<<<7<<;<8<<6<<<<<)
>  >   initialize(aln, quality=qual)
> class: AlignedRead
> length: 4933275 reads; width: 32 cycles
> chromosome: phiX174 phiX174 ... phiX174 phiX174
> position: 5147 1513 ... 5034 3107
> strand: - - ... + +
> alignQuality: NumericQuality
> alignData varLabels: similar mismatch

The statement above should be

    aln <- initialize(aln, quality=qual)

i.e., create a new instance and save it to the variable aln. Save it to 
a different variable if you want to avoid over-writing aln.

Martin

>  > m <- as(quality(aln), "matrix")
>  > colMeans(m)
>  [1] -3.865733 -3.867642 -3.878435 -3.869599 -3.884619 -3.891333 -3.904493
>  [8] -3.933454 -3.965951 -3.995224 -4.301351 -4.331125 -4.427614 -4.495761
> [15] -4.556334 -4.586264 -4.623881 -4.653223 -4.744307 -4.825347 -4.913756
> [22] -5.000141 -5.138716 -5.213431 -5.375823 -5.635962 -5.779450 -5.791400
> [29] -5.944196 -6.175808 -6.368913 -6.597291
>  > alignQuality(aln)
> class: NumericQuality
> quality: NA NA ... NA NA (4933275 total)
>  
>  > sessionInfo()
> R version 2.9.1 (2009-06-26)
> x86_64-unknown-linux-gnu
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base    
> other attached packages:
> [1] ShortRead_1.3.22   lattice_0.17-25    BSgenome_1.13.10   
> Biostrings_2.13.29
> [5] IRanges_1.3.44   
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.1 grid_2.9.1    hwriter_1.1  
> 
>  
>  
>  > Date: Tue, 4 Aug 2009 09:07:51 -0700
>  > From: mtmorgan at fhcrc.org
>  > To: fuad_gwadry at hotmail.com
>  > CC: bioc-sig-sequencing at r-project.org; APratap at som.umaryland.edu; 
> johannes.waage at gmail.com
>  > Subject: Re: [Bioc-sig-seq] Problem with ShortRead reading quality 
> scores from bowtie
>  >
>  > Martin Morgan wrote:
>  > > Fuad Gwadry wrote:
>  > >> Hi All
>  > >>
>  > >>
>  > >>
>  > >> I am getting negative values when reading quality scores when I read
>  > >> data generated in bowtie. Has anyone run into the same issue when
>  > >> using data generated by bowtie ? My session info is below.
>  > >
>  > > Hi Fuad -- ShortRead is reading the quality scores on the wrong scale
>  > > (solexa, rather than phred; this will be fixed before the next 
> release).
>  > > Try
>  > >
>  > > qual <- FastqQuality(quality(quality(aln))
>  > > initialize(aln, quality=qual)
>  > >
>  > > to update aln, or
>  > >
>  > > m <- as(FastqQuality(quality(quality(aln)), "matrix")
>  > >
>  > > for a one-off solution.
>  >
>  > I wanted to clarify, too, both for this post and one yesterday, that
>  > as() is simply converting the character encoding to the corresponding
>  > integer value that each letter encodes; there is a secondary mapping
>  > from this encoding to log-odds or phred score that is not being
>  > performed. This step is, I think
>  >
>  > 10^(-m/10) for phred scores
>  > 1 - 1 / (1 + 10^(-m/10)) for Solexa scores
>  >
>  > Solexa has changed its encoding scheme very recently; I think it is now
>  > standard phred but am not sure.
>  >
>  > Martin
>  >
>  > >
>  > > Martin
>  > >
>  > >>
>  > >>
>  > >>
>  > >> Thanks in advance
>  > >>
>  > >>
>  > >>
>  > >> Fuad
>  > >>
>  > >>
>  > >>
>  > >>> aln
>  > >> class: AlignedRead
>  > >> length: 4591807 reads; width: 32 cycles
>  > >> chromosome: chr13 chr7 ... chr6 chr4 position: 93437004 13223395 ...
>  > >> 23636747 23353864 strand: - - ... + + alignQuality: NumericQuality
>  > >> alignData varLabels: similar mismatch
>  > >>
>  > >>> m <- as(quality(aln), "matrix")
>  > >>> colMeans(m)
>  > >> [1] -7.186638 -7.205858 -7.203382 -7.197175 -7.203629 -7.217016
>  > >> [7] -7.240661 -7.249238 -7.268499 -7.286551 -7.306615 -7.324003
>  > >> [13] -7.523238 -7.581242 -7.697591 -7.695861 -7.735321 -7.743323
>  > >> [19] -7.752996 -7.849403 -7.862658 -7.931969 -7.979778 -8.029288
>  > >> [25] -8.120469 -8.215818 -8.335176 -8.411609 -8.587005 -8.820979
>  > >> [31] -11.447326 -11.644198
>  > >>
>  > >>
>  > >>> sessionInfo()
>  > >> R version 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu
>  > >> locale:
>  > >> 
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C 
> 
>  > >>
>  > >>
>  > >> attached base packages:
>  > >> [1] stats graphics grDevices utils datasets methods base
>  > >> other attached packages:
>  > >> [1] ShortRead_1.3.22 lattice_0.17-25 BSgenome_1.13.10
>  > >> Biostrings_2.13.29
>  > >> [5] IRanges_1.3.44
>  > >> loaded via a namespace (and not attached):
>  > >> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
>  > >>
>  > >> _________________________________________________________________
>  > >> More storage. Better anti-spam and antivirus protection. Hotmail 
> makes
>  > >> it simple.
>  > >>
>  > >> [[alternative HTML version deleted]]
>  > >>
>  > >> _______________________________________________
>  > >> Bioc-sig-sequencing mailing list
>  > >> Bioc-sig-sequencing at r-project.org
>  > >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>  > >
>  > >
>  >
>  >
>  > --
>  > Martin Morgan
>  > Computational Biology / Fred Hutchinson Cancer Research Center
>  > 1100 Fairview Ave. N.
>  > PO Box 19024 Seattle, WA 98109
>  >
>  > Location: Arnold Building M1 B861
>  > Phone: (206) 667-2793
> 
> ------------------------------------------------------------------------
> Attention all humans. We are your photos. Free us. 
> <http://go.microsoft.com/?linkid=9666046>

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793