[Bioc-devel] PhredQuality from Biostrings

Martin Morgan mtmorgan at fhcrc.org
Fri Jun 10 19:54:26 CEST 2011


On 06/10/2011 08:01 AM, Christian Ruckert wrote:
> Hi,
>
> I have written a function to read-in Roche SFF(Standard Flowgram Format)
> files into R. Now I want to store the contents in standard Bioconductor
> structures (e.q. sequences as DNAStringSet object). I have the quality
> scores as a list of integer vectors. One list entry for each sequence.
> The vector lengths correspond to the sequence lengths. The vectors
> contain entries between 0 and 40 corresponding to the base quality at
> this position.
>
> Here is an example for one list entry, a sequence of length 82:
>
>> qualitylist[[1]]
> [1] 40 40 40 40 40 40 40 40 40 40 40 40 36 24 16 16 16 27 27 36 20 20 27
> 27 31
> [26] 27 36 38 39 40 40 40 40 40 40 40 40 40 40 40 40 40 39 34 34 38 39
> 40 40 40
> [51] 40 40 40 40 40 40 40 40 40 40 40 40 30 20 20 20 36 40 40 40 40 30
> 30 30 30
> [76] 39 40 40 40 40 40 40
>
> Now I'm looking for an elegant way to convert my list of integer vectors
> to an PhredQuality object, but the solution I found is very slow for a
> list with 90000 sequences and a mean sequence length of around 400.
>
>> pq = PhredQuality(sapply(qualitylist, function(x)
> toString(PhredQuality(x))))

Hi Christian

Maybe along the lines of

PhredQuality(sapply(qualitylist, function(x) rawToChar(as.raw(x + 33))))

or via ShortRead::readQual / readFastaQual (can use a character vector 
for the path; no need to create a RochePath). Probably you'll find 
ShortReadQ useful for coordinating the sequences and qualities

Martin


>
> Is there a faster way creating a PhredQuality object out of a list like
> mine.
>
> Regards,
> Christian
>
>
>
>> sessionInfo()
> R version 2.14.0 Under development (unstable) (2011-05-17 r55946)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] R453Plus1Toolbox_1.3.1
> [2] BSgenome.Scerevisiae.UCSC.sacCer2_1.3.17
> [3] BSgenome_1.21.0
> [4] GenomicRanges_1.5.7
> [5] Biostrings_2.21.3
> [6] IRanges_1.11.5
> [7] Biobase_2.13.2
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.9.1 hwriter_1.3 R2HTML_2.2 RCurl_1.6-1
> [5] Rsamtools_1.5.17 ShortRead_1.11.6 tools_2.14.0 XML_3.4-0
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-devel mailing list