[Bioc-sig-seq] Problem with ShortRead reading quality scores from bowtie

Thu Aug 6 16:05:23 CEST 2009

Fuad Gwadry <fuad_gwadry at hotmail.com> writes:

> Hi Martin,
>  
> Thank you for the clarifications. 
>  
> Another question regarding quality scores. The Solexa/Illumina 1.0 format
> encodes a quality score from -5 to 40 using ASCII 59 to 104
> ([[http://en.wikipedia.org/wiki/FASTQ_format#Quality]]). However, if you look
> below the the ASCII characters do not reflect that. I am actually using
> Solexa/Illumina 0.3 but my understanding is that it is the same format as 1.0.
>  
> Am I thinking about this correctly ?

Hi Fuad --

Yes Solexa reports an encoding on ASCII 59:104. These are encoded
correctly by SFastqQuality

> solexa <- rawToChar(as.raw(59:104))
> solexa
[1] ";<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefgh"
> as(SFastqQuality(solexa), "matrix")
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]   -5   -4   -3   -2   -1    0    1    2    3     4
     [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
[1,]     5     6     7     8     9    10    11    12    13
     [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28]
[1,]    14    15    16    17    18    19    20    21    22
     [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
[1,]    23    24    25    26    27    28    29    30    31
     [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
[1,]    32    33    34    35    36    37    38    39    40

> (phred <- rawToChar(as.raw(33:126)))
[1] "!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
> as(FastqQuality(c(phred, phred)), "matrix")
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    1    2    3    4    5    6    7    8     9
     [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
[1,]    10    11    12    13    14    15    16    17    18
     [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28]
[1,]    19    20    21    22    23    24    25    26    27
     [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
[1,]    28    29    30    31    32    33    34    35    36
     [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
[1,]    37    38    39    40    41    42    43    44    45
     [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55]
[1,]    46    47    48    49    50    51    52    53    54
     [,56] [,57] [,58] [,59] [,60] [,61] [,62] [,63] [,64]
[1,]    55    56    57    58    59    60    61    62    63
     [,65] [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73]
[1,]    64    65    66    67    68    69    70    71    72
     [,74] [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82]
[1,]    73    74    75    76    77    78    79    80    81
     [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] [,91]
[1,]    82    83    84    85    86    87    88    89    90
     [,92] [,93] [,94]
[1,]    91    92    93

And yes, the SFastqQuality class will incorporate printable ASCII
characters outside 59:104 if the user provides them.

The quality scores look a lot like Phred encoding. You mentioned
Bowtie. The current Bowtie manual says it outputs Phred
scores. ?readAligned for type="Bowtie" says that you can specify this
input with the argument qualityType="FastqQuality".

Martin

> Fuad
>> alf
>  [1] " "  "!"  "\"" "#"  "$"  "%"  "&"  "'"  "("  ")"  "*"  "+"  ","  "-"  "."
> [16] "/"  "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  ":"  ";"  "<"  "="
> [31] ">"  "?"  "@"  "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "J"  "K"  "L"
> [46] "M"  "N"  "O"  "P"  "Q"  "R"  "S"  "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "["
> [61] "\\" "]"  "^"  "_"  "`"  "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j"
> [76] "k"  "l"  "m"  "n"  "o"  "p"  "q"  "r"  "s"  "t"  "u"  "v"  "w"  "x"  "y"
> [91] "z"  "{"  "|"  "}"
>> quality(aln)
> class: SFastqQuality
> quality:
>   A BStringSet instance of length 4933275
>           width seq
>       [1]    32 <<<<<<<<<<<<<<<<<;<<<;<<<<<5.<,:
>       [2]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>       [3]    32 <<<<<<<<<<<<;<<;<<<<<<<+<<<<<<<<
>       [4]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>       [5]    32 <<<<<<<<<<<<<<<<<<<;<<<<<<<<<<6<
>       [6]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>       [7]    32 <<<<<<<<<<<<<<<<+<-<<<<;<<<<<<<<
>       [8]    32 <<<<<<<<<<<<<<<<<<<<<<<<<,<+<<<<
>       [9]    32 <<<<<<<<<<<<<<<<<<<5<<<<7<<<<<<2
>       ...   ... ...
> [4933267]    32 <<<<<<<<<<<<<<<<<<<<<<3<<8<<<<<<
> [4933268]    32 <<<<<<<<<<<<<<<<<<<<<;<<<98<<<<<
> [4933269]    32 <<<<<<<<<<<<<<<<</<<<5<<1;<;<7<<
> [4933270]    32 <<<<<<<<<<<<<<<<<<<<<<<<;<<<<<6:
> [4933271]    32 <<<<<<<<73;<;<<<<<;';57<;<0;<<55
> [4933272]    32 <<<<<<3<<<<<<<<6<<<<<<-7<<<7*<<7
> [4933273]    32 <<<<<<<<<<<<<3<<<<<<+<<6<<<4&5<6
> [4933274]    32 <<<<<<<<<<<<<<<<<<<<<<<<4<<<<<<<
> [4933275]    32 <<<<<<<<<<<<<<<<<7<<;<8<<6<<<<<)
>  
>> Date: Wed, 5 Aug 2009 09:21:56 -0700
>> From: mtmorgan at fhcrc.org
>> To: fuad_gwadry at hotmail.com
>> CC: bioc-sig-sequencing at r-project.org; apratap at som.umaryland.edu;
> johannes.waage at gmail.com
>> Subject: Re: [Bioc-sig-seq] Problem with ShortRead reading quality scores
> from bowtie
>>
>> Hi Faud --
>>
>> Fuad Gwadry wrote:
>> > Hi Martin.
>> >
>> > My data is generated from the old pipeline (illumina 0.3) and
>> > therefore should not be a problem. In any case I tested your suggestion
>> > but it had no effect on the data. The colMeans(m) are the same before
>> > and after incorporating your suggestions. Also when I use alignQuality I
>> > get NAs in both cases. Let me know how I can help to trouble shoot this.
>>
>> NA's in alignQuality are present when the reads did not align; not all
>> of the values will be NA. see below for the issue with base quality scores.
>>
>> > -- Fuad
>> >
>> > The first part is using my original data:
>> >
>> > > aln
>> > class: AlignedRead
>> > length: 4933275 reads; width: 32 cycles
>> > chromosome: phiX174 phiX174 ... phiX174 phiX174
>> > position: 5147 1513 ... 5034 3107
>> > strand: - - ... + +
>> > alignQuality: NumericQuality
>> > alignData varLabels: similar mismatch
>> > > head(quality(aln))
>> > class: SFastqQuality
>> > quality:
>> > A BStringSet instance of length 6
>> > width seq
>> > [1] 32 <<<<<<<<<<<<<<<<<;<<<;<<<<<5.<,:
>> > [2] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> > [3] 32 <<<<<<<<<<<<;<<;<<<<<<<+<<<<<<<<
>> > [4] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> > [5] 32 <<<<<<<<<<<<<<<<<<<;<<<<<<<<<<6<
>> > [6] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> > > m <- as(quality(aln), "matrix")
>> > > colMeans(m)
>> > [1] -3.865733 -3.867642 -3.878435 -3.869599 -3.884619 -3.891333 -3.904493
>> > [8] -3.933454 -3.965951 -3.995224 -4.301351 -4.331125 -4.427614 -4.495761
>> > [15] -4.556334 -4.586264 -4.623881 -4.653223 -4.744307 -4.825347 -4.913756
>> > [22] -5.000141 -5.138716 -5.213431 -5.375823 -5.635962 -5.779450 -5.791400
>> > [29] -5.944196 -6.175808 -6.368913 -6.597291
>> > > alignQuality(aln)
>> > class: NumericQuality
>> > quality: NA NA ... NA NA (4933275 total)
>> >
>> > ##################
>> > incorporating your suggestions
>> > ##################
>> >
>> > > qual <- FastqQuality(quality(quality(aln)))
>> > > qual
>> > class: FastqQuality
>> > quality:
>> > A BStringSet instance of length 4933275
>> > width seq
>> > [1] 32 <<<<<<<<<<<<<<<<<;<<<;<<<<<5.<,:
>> > [2] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> > [3] 32 <<<<<<<<<<<<;<<;<<<<<<<+<<<<<<<<
>> > [4] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> > [5] 32 <<<<<<<<<<<<<<<<<<<;<<<<<<<<<<6<
>> > [6] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>> > [7] 32 <<<<<<<<<<<<<<<<+<-<<<<;<<<<<<<<
>> > [8] 32 <<<<<<<<<<<<<<<<<<<<<<<<<,<+<<<<
>> > [9] 32 <<<<<<<<<<<<<<<<<<<5<<<<7<<<<<<2
>> > ... ... ...
>> > [4933267] 32 <<<<<<<<<<<<<<<<<<<<<<3<<8<<<<<<
>> > [4933268] 32 <<<<<<<<<<<<<<<<<<<<<;<<<98<<<<<
>> > [4933269] 32 <<<<<<<<<<<<<<<<</<<<5<<1;<;<7<<
>> > [4933270] 32 <<<<<<<<<<<<<<<<<<<<<<<<;<<<<<6:
>> > [4933271] 32 <<<<<<<<73;<;<<<<<;';57<;<0;<<55
>> > [4933272] 32 <<<<<<3<<<<<<<<6<<<<<<-7<<<7*<<7
>> > [4933273] 32 <<<<<<<<<<<<<3<<<<<<+<<6<<<4&5<6
>> > [4933274] 32 <<<<<<<<<<<<<<<<<<<<<<<<4<<<<<<<
>> > [4933275] 32 <<<<<<<<<<<<<<<<<7<<;<8<<6<<<<<)
>> > > initialize(aln, quality=qual)
>> > class: AlignedRead
>> > length: 4933275 reads; width: 32 cycles
>> > chromosome: phiX174 phiX174 ... phiX174 phiX174
>> > position: 5147 1513 ... 5034 3107
>> > strand: - - ... + +
>> > alignQuality: NumericQuality
>> > alignData varLabels: similar mismatch
>>
>> The statement above should be
>>
>> aln <- initialize(aln, quality=qual)
>>
>> i.e., create a new instance and save it to the variable aln. Save it to
>> a different variable if you want to avoid over-writing aln.
>>
>> Martin
>>
>> > > m <- as(quality(aln), "matrix")
>> > > colMeans(m)
>> > [1] -3.865733 -3.867642 -3.878435 -3.869599 -3.884619 -3.891333 -3.904493
>> > [8] -3.933454 -3.965951 -3.995224 -4.301351 -4.331125 -4.427614 -4.495761
>> > [15] -4.556334 -4.586264 -4.623881 -4.653223 -4.744307 -4.825347 -4.913756
>> > [22] -5.000141 -5.138716 -5.213431 -5.375823 -5.635962 -5.779450 -5.791400
>> > [29] -5.944196 -6.175808 -6.368913 -6.597291
>> > > alignQuality(aln)
>> > class: NumericQuality
>> > quality: NA NA ... NA NA (4933275 total)
>> >
>> > > sessionInfo()
>> > R version 2.9.1 (2009-06-26)
>> > x86_64-unknown-linux-gnu
>> > locale:
>> >
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>> > attached base packages:
>> > [1] stats graphics grDevices utils datasets methods base
>> > other attached packages:
>> > [1] ShortRead_1.3.22 lattice_0.17-25 BSgenome_1.13.10
>> > Biostrings_2.13.29
>> > [5] IRanges_1.3.44
>> > loaded via a namespace (and not attached):
>> > [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
>> >
>> >
>> >
>> > > Date: Tue, 4 Aug 2009 09:07:51 -0700
>> > > From: mtmorgan at fhcrc.org
>> > > To: fuad_gwadry at hotmail.com
>> > > CC: bioc-sig-sequencing at r-project.org; APratap at som.umaryland.edu;
>> > johannes.waage at gmail.com
>> > > Subject: Re: [Bioc-sig-seq] Problem with ShortRead reading quality
>> > scores from bowtie
>> > >
>> > > Martin Morgan wrote:
>> > > > Fuad Gwadry wrote:
>> > > >> Hi All
>> > > >>
>> > > >>
>> > > >>
>> > > >> I am getting negative values when reading quality scores when I read
>> > > >> data generated in bowtie. Has anyone run into the same issue when
>> > > >> using data generated by bowtie ? My session info is below.
>> > > >
>> > > > Hi Fuad -- ShortRead is reading the quality scores on the wrong scale
>> > > > (solexa, rather than phred; this will be fixed before the next
>> > release).
>> > > > Try
>> > > >
>> > > > qual <- FastqQuality(quality(quality(aln))
>> > > > initialize(aln, quality=qual)
>> > > >
>> > > > to update aln, or
>> > > >
>> > > > m <- as(FastqQuality(quality(quality(aln)), "matrix")
>> > > >
>> > > > for a one-off solution.
>> > >
>> > > I wanted to clarify, too, both for this post and one yesterday, that
>> > > as() is simply converting the character encoding to the corresponding
>> > > integer value that each letter encodes; there is a secondary mapping
>> > > from this encoding to log-odds or phred score that is not being
>> > > performed. This step is, I think
>> > >
>> > > 10^(-m/10) for phred scores
>> > > 1 - 1 / (1 + 10^(-m/10)) for Solexa scores
>> > >
>> > > Solexa has changed its encoding scheme very recently; I think it is now
>> > > standard phred but am not sure.
>> > >
>> > > Martin
>> > >
>> > > >
>> > > > Martin
>> > > >
>> > > >>
>> > > >>
>> > > >>
>> > > >> Thanks in advance
>> > > >>
>> > > >>
>> > > >>
>> > > >> Fuad
>> > > >>
>> > > >>
>> > > >>
>> > > >>> aln
>> > > >> class: AlignedRead
>> > > >> length: 4591807 reads; width: 32 cycles
>> > > >> chromosome: chr13 chr7 ... chr6 chr4 position: 93437004 13223395 ...
>> > > >> 23636747 23353864 strand: - - ... + + alignQuality: NumericQuality
>> > > >> alignData varLabels: similar mismatch
>> > > >>
>> > > >>> m <- as(quality(aln), "matrix")
>> > > >>> colMeans(m)
>> > > >> [1] -7.186638 -7.205858 -7.203382 -7.197175 -7.203629 -7.217016
>> > > >> [7] -7.240661 -7.249238 -7.268499 -7.286551 -7.306615 -7.324003
>> > > >> [13] -7.523238 -7.581242 -7.697591 -7.695861 -7.735321 -7.743323
>> > > >> [19] -7.752996 -7.849403 -7.862658 -7.931969 -7.979778 -8.029288
>> > > >> [25] -8.120469 -8.215818 -8.335176 -8.411609 -8.587005 -8.820979
>> > > >> [31] -11.447326 -11.644198
>> > > >>
>> > > >>
>> > > >>> sessionInfo()
>> > > >> R version 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu
>> > > >> locale:
>> > > >>
>> >
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>> >
>> > > >>
>> > > >>
>> > > >> attached base packages:
>> > > >> [1] stats graphics grDevices utils datasets methods base
>> > > >> other attached packages:
>> > > >> [1] ShortRead_1.3.22 lattice_0.17-25 BSgenome_1.13.10
>> > > >> Biostrings_2.13.29
>> > > >> [5] IRanges_1.3.44
>> > > >> loaded via a namespace (and not attached):
>> > > >> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
>> > > >>
>> > > >> _________________________________________________________________
>> > > >> More storage. Better anti-spam and antivirus protection. Hotmail
>> > makes
>> > > >> it simple.
>> > > >>
>> > > >> [[alternative HTML version deleted]]
>> > > >>
>> > > >> _______________________________________________
>> > > >> Bioc-sig-sequencing mailing list
>> > > >> Bioc-sig-sequencing at r-project.org
>> > > >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Martin Morgan
>> > > Computational Biology / Fred Hutchinson Cancer Research Center
>> > > 1100 Fairview Ave. N.
>> > > PO Box 19024 Seattle, WA 98109
>> > >
>> > > Location: Arnold Building M1 B861
>> > > Phone: (206) 667-2793
>> >
>> > ------------------------------------------------------------------------
>> > Attention all humans. We are your photos. Free us.
>> > <http://go.microsoft.com/?linkid=9666046>
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>
> ------------------------------------------------------------------------------
>
> Send and receive email from all of your webmail accounts - right from [[your
> Hotmail inbox!]]

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793