[BioC] Calculating alignment scores from aligned sequences

Fri Apr 27 22:40:11 CEST 2012

Hi Steve, Michael,

On 04/17/2012 04:08 PM, Michael Lawrence wrote:
> Hi Steve,
>
> I doubt there is anything in Biostrings for this, as it is an unusual case
> to score an alignment after generating the alignment, since the scoring and
> the alignment are intertwined in the algorithm.

I just had a look at the C code in Biostrings in charge of computing
the score (src/align_pairwiseAlignment.c file, pairwiseAlignment
function starting at line 90) and what I see confirms what Michael
is saying: the scoring happens at the same time the alignment itself
is computed. I guess that's just a consequence of the "dynamic
programming" nature of the Smith-Waterman algo.

A separate C function for computing the score of 2 already aligned
sequences might not be too hard to implement but I suspect it won't
reuse much of the existing code, just because the problem it tries to
solve is different. It's also much simpler problem.

Actually the problem might be easy to solve with pure R code (and
in a way that might be efficient enough if you have a small number
of aligned sequences).

FWIW, here is a simple (and limited) solution that deals with DNA
input only, "globally" aligned sequences only, where the scoring
is defined by a substitution matrix and the gap opening and gap
extension penalties only (no 'patternQuality', 'subjectQuality'
or 'fuzzyMatrix' args):

scoreTwoAlignedSequences <- function(pattern, subject,
                                      substitutionMatrix,
                                      gapOpening=-10, gapExtension=-4)
{
     if (!is(pattern, "DNAString") || !is(subject, "DNAString") ||
         length(pattern) != length(subject))
         stop("'pattern' and 'subject' must be DNAString objects ",
              "of the same length")
     if (!is.matrix(substitutionMatrix))
         stop("'substitutionMatrix' must be a matrix")
     pletters <- strsplit(as.character(pattern), split=NULL)[[1L]]
     sletters <- strsplit(as.character(subject), split=NULL)[[1L]]
     is_indel <- pletters == "-" | sletters =="-"
     ## Subsetting the substitutionMatrix matrix by a 2-column matrix
     ## (a little known feature of [ very useful here):
     ii <- cbind(pletters[!is_indel], sletters[!is_indel])
     score <- sum(substitutionMatrix[ii])
     prle <- Rle(pletters)
     pgaps_runs <- which(runValue(prle) == "-")
     score <- score + length(pgaps_runs) * gapOpening +
                      sum(runLength(prle)[pgaps_runs]) * gapExtension
     srle <- Rle(sletters)
     sgaps_runs <- which(runValue(srle) == "-")
     score <- score + length(sgaps_runs) * gapOpening +
                      sum(runLength(srle)[sgaps_runs]) * gapExtension
     score
}

Of course a serious candidate for inclusion to Biostrings would need
to be extended to support all the sort of input supported by
pairwiseAlignment()... and also use a better name (using a generic
+ methods would be better too)... and documented. Doesn't sound like
just a 1- or 2-hour project to me :-/

H.

> I can't speak for everyone
> else, but it could be a useful contribution to Biostrings.
>
> Michael
>
> On Tue, Apr 17, 2012 at 2:07 PM, Steve Lianoglou<
> mailinglist.honeypot at gmail.com>  wrote:
>
>> Hi,
>>
>> I thought I'd add to the pairwiseAlignment-related activity on the
>> list today, so here's a bit of navel gazing.
>>
>> I *think* I want to calculate alignment scores for subsets of already
>> aligned sequences and I don't think there's an easy and efficient way
>> to do this without having to realign the subsets of the strings I'm
>> interested in.
>>
>> For instance, say I have a pairwise alignment `pa`, and the result of
>> `compareStrings(pattern(pa), subject(pa))` looks like something like
>> this:
>>
>>
>> "GTA?TTT?A-----TTTCATATC?TGT?TC?------------------------------------------------------CAT"
>>
>> Given values for the substitutionMatrix, gapOpenening, and
>> gapExtension penalties used during the pairwiseAlignment, one could
>> (in principle), calculate the alignment score  of the first half vs.
>> the second half, or (say) running windows of length N along the
>> alignment.
>>
>> Like I said, I *think* this is what I want to do and I was just
>> curious if I'm not missing something in Biostrings that I can wire
>> into ... although it's straight-forward enough to write something
>> myself. In the gapOpening == gapExtension case, I can imagine
>> massaging the `compareString` to an integer vector and then doing a
>> simple matrix multiplication into an augmented substitutionMatrix, but
>> I think the more common gapOpening != gapExtension case suggests I
>> should really write a helper function in C(++).
>>
>> Thanks,
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>   | Memorial Sloan-Kettering Cancer Center
>>   | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319