[Bioc-sig-seq] Confused on fastq to numeric conversion (SolexaPipeline 1.3+)

Fri Nov 27 04:26:20 CET 2009

Hello Martin and Harris,

Thank you for your great, very complete and extremely fast replies!! ^_^

Greetings,
Leonardo

PD Just CCing so the replies get archived on the mailing list for later 
newbies like me ;)

Martin Morgan wrote:
> Harris A. Jaffee wrote:
>   
>> Perhaps this is a manifestation of those silent choices:
>>
>>     
>>> as(quality(reads[1])[[1]], "numeric")
>>>       
>>  [1] 97 97 97 96 92 97 97 97 95 96 95 96 97 97 97 97 97 97 94 88 97 96
>> 91 68 90
>> [26] 96 96 97 96 96 97 96 96 96 97 96 94 96 97 97 97 83 96 92 93 96 96
>> 92 92 97
>>     
>
> It's not really obvious to me why quality(<...>)[[1]] extracts a
> BStringSet. But once it does, the BStringSet is convert to numeric as
> appropriate for a BString, which turns out to be the ASCII encoding.
>
>   
>>> as(quality(reads[1]), "numeric")
>>>       
>>  [1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63
>> 58 35 57
>> [26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
>> 59 59 64
>>     
>
> quality(<...>) is a SFastqQuality or FastqQuality object, and is
> converted to numeric as appropriate for short read qualities.
>
>   
>> I am less than a novice in ShortRead, so I don't feel adequate enough to
>> post to
>> the world.  Barely adequate enough to make a private guess.
>>     
>
> You're too modest, Harris.
>
> Martin
>
>   
>> On Nov 26, 2009, at 6:13 PM, Martin Morgan wrote:
>>
>>     
>>> Hi Leonardo --
>>>
>>> Leonardo Collado Torres wrote:
>>>       
>>>> Dear list,
>>>>
>>>> I'm puzzled over something very simple. I've got some fresh Illumina
>>>> data and I want to make a plot of the median (10% quantile, etc)
>>>> qualities (Y axis) per cycle (X axis).
>>>>
>>>> On the code below, I'm simply reading a test file in fastq format,
>>>> printing the quality, checking that nothing weird went when reading the
>>>> file (could be :P), and changing the ASCII qualities to numeric values.
>>>> The symbol "D" gets changed to 35; the lowest value on the 1st sequence.
>>>>
>>>>         
>>>>> library(ShortRead)
>>>>> reads <- readFastq(dirPath = ".", pattern = "test.txt")
>>>>>           
>>>>> reads[1]
>>>>>           
>>>> class: ShortReadQ
>>>> length: 1 reads; width: 50 cycles
>>>>         
>>>>> sread(reads[1])
>>>>>           
>>>>  A DNAStringSet instance of length 1
>>>>    width seq
>>>> [1]    50 ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
>>>>         
>>>>> quality(reads[1])
>>>>>           
>>>> class: FastqQuality
>>>> quality:
>>>>  A BStringSet instance of length 1
>>>>    width seq
>>>> [1]    50 aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
>>>>         
>>>>> system("head -n 4 test.txt")
>>>>>           
>>>> @HWUSI-EAS636_0001:8:1:1:1353#0/1
>>>> ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
>>>> +HWUSI-EAS636_0001:8:1:1:1353#0/1
>>>> aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
>>>>         
>>>>> as(quality(reads[1]), "numeric") # Will use "matrix" when using all
>>>>>           
>>>> reads instead of the 1st one
>>>> [1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63 58
>>>> 35 57
>>>> [26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
>>>> 59 59 64
>>>>         
>>>>> summary(as(quality(reads[1]), "numeric"))
>>>>>           
>>>>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>>   35.0    62.0    63.0    61.7    64.0    64.0
>>>>         
>>>> Now, what confuses me is that according to
>>>> http://en.wikipedia.org/wiki/FASTQ_format SolexaPipeline 1.3+ "can
>>>> encode a Phred quality score from 0 to 62 using ASCII 64 to 126
>>>> (although in raw read data Phred scores from 0 to 40 only are
>>>> expected)". Then, on the ASCII table at wiki
>>>> (http://en.wikipedia.org/wiki/ASCII) glyph (I called it symbol earlier)
>>>> "D" corresponds to 68 decimal, which should be a Phred quality score of
>>>> 6 but its 35 in R. Glyph "a" is 97 decimal, which should be 35 but is 64
>>>> in R.
>>>>         
>>> fastq files don't contain enough information to know whether the data
>>> are encoded with ASCII 64 == 1 (Solexa-style) or ASCII 32 == 1
>>> ('classic'-style). There is an argument qualityType that you can use to
>>> specify which, so
>>>
>>>   readFastq("test.txt", qualityType="SFastqQuality")
>>>
>>> for Solexa encoding. Note that to read a single file it is no longer
>>> necessary to specify both directory and pattern. And finally that
>>> as(<...>, "matrix") etc convert from the ASCII to integer encoding;
>>> there is still a transform from integer to p value, if that is desired,
>>> and that there are several possibilities for that transformation as well.
>>>
>>> Martin
>>>
>>>       
>>>> So, do I substract 29 to all the values I get in R or am I missing
>>>> something?
>>>>
>>>> Thank you and enjoy Thanksgiving if you celebrate it,
>>>> Leonardo
>>>>
>>>>
>>>>
>>>>         
>>>>> sessionInfo()
>>>>>           
>>>> R version 2.10.0 (2009-10-26)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C                 [3]
>>>> LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915   [5]
>>>> LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915  [7]
>>>> LC_PAPER=en_US.iso885915       LC_NAME=C                    [9]
>>>> LC_ADDRESS=C                   LC_TELEPHONE=C               [11]
>>>> LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>> other attached packages:
>>>> [1] ShortRead_1.4.0   lattice_0.17-26   BSgenome_1.14.0  
>>>> Biostrings_2.14.0
>>>> [5] IRanges_1.4.0
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.6.0 grid_2.10.0   hwriter_1.1   tools_2.10.0
>>>>         
>>> -- 
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>       
>
>
>   

-- 
Leonardo Collado Torres, Bachelor in Genomic Sciences
Teacher at LCG and member of Dr. Enrique Morett's lab
UNAM Campus Cuernavaca, Mexico

Homepage: http://www.lcg.unam.mx/~lcollado/
Phone: [52] (777) 313-28-05