[Bioc-devel] phred qualities
Thomas Girke
thomas.girke at ucr.edu
Wed Jun 27 22:14:17 CEST 2012
I am not sure if this will help achieving what you are trying to do, but when I
am dealing with reads that are not of constant length and I want to perform
efficient matrix-based computations on their phred scores or base calls then I
inject them into a matrix pre-populated with N/NA values and of dimensions
N_rows = number_of_reads and N_columns = length_of_longest_read. Obtaining
such a matrix is relatively efficient since one only needs to iterate
over the number distinct length values observed in the sequence set.
For instance, if fq is a ShortReadQ object generated by readFastq from ShortRead
then one can obtain the quality (q) and base (s) matrices like this:
mymin <- min(width(fq))
mymax <- max(width(fq))
s <- matrix("N", length(fq), mymax)
q <- matrix(NA, length(fq), mymax) # Or use 0 instead of NAs
for(i in mymin:mymax) {
index <- width(fq)==i
if(any(index)) {
s[index, 1:i] <- as.matrix(DNAStringSet(sread(fq)[index], start=1, end=i))
q[index, 1:i] <- as.matrix(PhredQuality(quality(fq))[index])
}
}
When it comes to turning q and s back to XStringSet objects then things are
not so efficient anymore. Perhaps there is an elegant/fast option to coerce a
matrix of Phred scores back to a BStringSet object that I am not aware of?
Thomas
On Wed, Jun 27, 2012 at 06:26:25PM +0000, Martin Morgan wrote:
> On 06/27/2012 11:22 AM, Martin Morgan wrote:
> > On 06/27/2012 08:02 AM, Kasper Daniel Hansen wrote:
> >> Phred qualities are usually presented as ascii-encode numbers with an
> >> offset of either 32 or 64. Some packages returns this as a
> >> BStringSet. I can convert a character vector "charvec" to a list of
> >> integers using code like
> >> sapply(charvec, function(xx) charToRaw(xx) - 33L)
> >>
> >> Do we have fast(er) ways of doing this, when charvec is really long
> >> and not necessarily with the same number of chars in each string? I
> >> am thinking of implementing the sapply() above in C (directly
> >> vectorizing it), but surely someone has done something like that
> >> somewhere.
> >
> > I think you get this with XStringSet, e.g., PhredQuality, with
> >
> > x = PhredQuality(c("HH", "III"))
> > y = as.numeric(unlist(x)) - 33L
>
> as.integer
>
> > z = relist(y, x)
>
> or for a simple list
>
> split(y, rep(seq_along(x), elementLengths(x))
>
> I have a recollection that there is something built-in...
>
> Martin
>
> >
> > Martin
> >
> >>
> >> Kasper
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> >
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list