[Bioc-devel] phred qualities

Thomas Girke thomas.girke at ucr.edu
Wed Jun 27 22:14:17 CEST 2012


I am not sure if this will help achieving what you are trying to do, but when I
am dealing with reads that are not of constant length and I want to perform                            
efficient matrix-based computations on their phred scores or base calls then I
inject them into a matrix pre-populated with N/NA values and of dimensions                                                                                                                                                                    
N_rows = number_of_reads and N_columns = length_of_longest_read. Obtaining                                                                                                                                                                    
such a matrix is relatively efficient since one only needs to iterate                                                                                                                                                                         
over the number distinct length values observed in the sequence set.                                                                                                                                                                          
                                                                                            
For instance, if fq is a ShortReadQ object generated by readFastq from ShortRead                                                                                                                                                              
then one can obtain the quality (q) and base (s) matrices like this: 
            
mymin <- min(width(fq))                                                                                                                                                                                                                       
mymax <- max(width(fq))                                                                                                                                                                                                                       
s <- matrix("N", length(fq), mymax)                                                                                                                                                                                                           
q <- matrix(NA, length(fq), mymax) # Or use 0 instead of NAs                                                                                                                                                                                  
for(i in mymin:mymax) {                                                                                                                                                                                                                       
     index <- width(fq)==i                                                                                                                                                                                                                    
     if(any(index)) {                                                                                                                                                                                                                         
        s[index, 1:i] <- as.matrix(DNAStringSet(sread(fq)[index], start=1, end=i))                                                                                                                                                            
        q[index, 1:i] <- as.matrix(PhredQuality(quality(fq))[index])                                                                                                                                                                          
     }                                                                                                                                                                                                                                        
}                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                              
When it comes to turning q and s back to XStringSet objects then things are                                                                                                                                                                   
not so efficient anymore. Perhaps there is an elegant/fast option to coerce a                                                                                                                                                                 
matrix of Phred scores back to a BStringSet object that I am not aware of?                                                                                                                                                                    
                                                                                                                                                                                                                                              
Thomas           

On Wed, Jun 27, 2012 at 06:26:25PM +0000, Martin Morgan wrote:
> On 06/27/2012 11:22 AM, Martin Morgan wrote:
> > On 06/27/2012 08:02 AM, Kasper Daniel Hansen wrote:
> >> Phred qualities are usually presented as ascii-encode numbers with an
> >> offset of either 32 or 64. Some packages returns this as a
> >> BStringSet. I can convert a character vector "charvec" to a list of
> >> integers using code like
> >> sapply(charvec, function(xx) charToRaw(xx) - 33L)
> >>
> >> Do we have fast(er) ways of doing this, when charvec is really long
> >> and not necessarily with the same number of chars in each string? I
> >> am thinking of implementing the sapply() above in C (directly
> >> vectorizing it), but surely someone has done something like that
> >> somewhere.
> >
> > I think you get this with XStringSet, e.g., PhredQuality, with
> >
> > x = PhredQuality(c("HH", "III"))
> > y = as.numeric(unlist(x)) - 33L
> 
>    as.integer
> 
> > z = relist(y, x)
> 
> or for a simple list
> 
>    split(y, rep(seq_along(x), elementLengths(x))
> 
> I have a recollection that there is something built-in...
> 
> Martin
> 
> >
> > Martin
> >
> >>
> >> Kasper
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> >
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list