[Bioc-devel] phred qualities

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Wed Jun 27 22:17:45 CEST 2012


On Wed, Jun 27, 2012 at 4:14 PM, Thomas Girke <thomas.girke at ucr.edu> wrote:
> I am not sure if this will help achieving what you are trying to do, but when I
> am dealing with reads that are not of constant length and I want to perform
> efficient matrix-based computations on their phred scores or base calls then I
> inject them into a matrix pre-populated with N/NA values and of dimensions
> N_rows = number_of_reads and N_columns = length_of_longest_read. Obtaining
> such a matrix is relatively efficient since one only needs to iterate
> over the number distinct length values observed in the sequence set.
>
> For instance, if fq is a ShortReadQ object generated by readFastq from ShortRead
> then one can obtain the quality (q) and base (s) matrices like this:
>
> mymin <- min(width(fq))
> mymax <- max(width(fq))
> s <- matrix("N", length(fq), mymax)
> q <- matrix(NA, length(fq), mymax) # Or use 0 instead of NAs
> for(i in mymin:mymax) {
>     index <- width(fq)==i
>     if(any(index)) {
>        s[index, 1:i] <- as.matrix(DNAStringSet(sread(fq)[index], start=1, end=i))
>        q[index, 1:i] <- as.matrix(PhredQuality(quality(fq))[index])
>     }
> }
>
> When it comes to turning q and s back to XStringSet objects then things are
> not so efficient anymore. Perhaps there is an elegant/fast option to coerce a
> matrix of Phred scores back to a BStringSet object that I am not aware of?

I think there is.

Anyway, thanks for all the responses.  I think I will write up some
more methods for the XStringQuality stuff in Biostrings - later this
week.

Kasper

>
> Thomas
>
> On Wed, Jun 27, 2012 at 06:26:25PM +0000, Martin Morgan wrote:
>> On 06/27/2012 11:22 AM, Martin Morgan wrote:
>> > On 06/27/2012 08:02 AM, Kasper Daniel Hansen wrote:
>> >> Phred qualities are usually presented as ascii-encode numbers with an
>> >> offset of either 32 or 64. Some packages returns this as a
>> >> BStringSet. I can convert a character vector "charvec" to a list of
>> >> integers using code like
>> >> sapply(charvec, function(xx) charToRaw(xx) - 33L)
>> >>
>> >> Do we have fast(er) ways of doing this, when charvec is really long
>> >> and not necessarily with the same number of chars in each string? I
>> >> am thinking of implementing the sapply() above in C (directly
>> >> vectorizing it), but surely someone has done something like that
>> >> somewhere.
>> >
>> > I think you get this with XStringSet, e.g., PhredQuality, with
>> >
>> > x = PhredQuality(c("HH", "III"))
>> > y = as.numeric(unlist(x)) - 33L
>>
>>    as.integer
>>
>> > z = relist(y, x)
>>
>> or for a simple list
>>
>>    split(y, rep(seq_along(x), elementLengths(x))
>>
>> I have a recollection that there is something built-in...
>>
>> Martin
>>
>> >
>> > Martin
>> >
>> >>
>> >> Kasper
>> >>
>> >> _______________________________________________
>> >> Bioc-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>> >
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list