[Bioc-devel] PhredQuality from Biostrings

Christian Ruckert cruckert at uni-muenster.de
Fri Jun 10 17:01:59 CEST 2011


Hi,

I have written a function to read-in Roche SFF(Standard Flowgram Format) 
files into R. Now I want to store the contents in standard Bioconductor 
structures (e.q. sequences as DNAStringSet object). I have the quality 
scores as a list of integer vectors. One list entry for each sequence. 
The vector lengths correspond to the sequence lengths. The vectors 
contain entries between 0 and 40 corresponding to the base quality at 
this position.

Here is an example for one list entry, a sequence of length 82:

> qualitylist[[1]]
  [1] 40 40 40 40 40 40 40 40 40 40 40 40 36 24 16 16 16 27 27 36 20 20 
27 27 31
[26] 27 36 38 39 40 40 40 40 40 40 40 40 40 40 40 40 40 39 34 34 38 39 
40 40 40
[51] 40 40 40 40 40 40 40 40 40 40 40 40 30 20 20 20 36 40 40 40 40 30 
30 30 30
[76] 39 40 40 40 40 40 40

Now I'm looking for an elegant way to convert my list of integer vectors 
to an PhredQuality object, but the solution I found is very slow for a 
list with 90000 sequences and a mean sequence length of around 400.

> pq = PhredQuality(sapply(qualitylist, function(x)
toString(PhredQuality(x))))

Is there a faster way creating a PhredQuality object out of a list like 
mine.

Regards,
Christian



> sessionInfo()
R version 2.14.0 Under development (unstable) (2011-05-17 r55946)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] R453Plus1Toolbox_1.3.1
[2] BSgenome.Scerevisiae.UCSC.sacCer2_1.3.17
[3] BSgenome_1.21.0
[4] GenomicRanges_1.5.7
[5] Biostrings_2.21.3
[6] IRanges_1.11.5
[7] Biobase_2.13.2

loaded via a namespace (and not attached):
[1] biomaRt_2.9.1    hwriter_1.3      R2HTML_2.2       RCurl_1.6-1
[5] Rsamtools_1.5.17 ShortRead_1.11.6 tools_2.14.0     XML_3.4-0



More information about the Bioc-devel mailing list