[BioC] homopolymer gap extension penalty in Biostrings

Wed Mar 7 02:43:32 CET 2012

Hi Frederick,

Let's please use the "Reply All" button so the discussion stays on the
mailing list.

On 03/06/2012 05:12 PM, Frederick Ross wrote:
> On 3/6/12 4:56 PM, Hervé Pagès wrote:
>> Just to clarify, why are you saying "in the opposite strand" when you
>> say:
>>
>> specify a lower penalty for a gap extension when there is a
>> homopolymeric insertion in the opposite strand
>>
>> Could you provide an example?
>
> Perhaps I can provide some context: the basic unit of pyrosequencing is
> a homopolymer. I'm working with some preliminary data from Ion Torrent's
> machine, and the primary (and I mean by an order of magnitude) error
> rate is miscounting the length of homopolymers, often by large fractions
> (measuring 10 bases in place of 2, for instance).
>
> If you set the gap penalties low enough to handle this, you break
> alignment entirely since every mismatch and real gap goes all over the
> place. If you set the gap penalties high enough, the aligner breaks again.
>
> I think the real solution is to work in homopolymers. I've been
> homoencoding my sequencings, so AATCCGTTTA becomes A2 T1 C2 G1 T3 A1. In
> this representation, I can write down an appropriate scoring function
> that accounts for both base difference and length difference in
> homopolymer, i.e.,
>
> s(base1,length1,base2,length2) =
> v*abs(length1-length2)
> + if length1==length2 then 0 else a
> + if base1==base2 then -k else k
>
> However, I don't know of a general enough implementation of
> Smith-Waterman that I can hand a generalized function like this, along
> with a generalized alphabet.

I see. Those extra details are useful. Using a Run Length Encoding
representation of the reads is an interesting idea. FWIW, the IRanges
package implements the Rle class. Any regular atomic vector can
be turned into an Rle object. For example:

 > library(IRanges)
 > rledna <- Rle(strsplit("AATCCGTTTA", "")[[1]])
 > rledna
'character' Rle of length 10 with 6 runs
   Lengths:   2   1   2   1   3   1
   Values : "A" "T" "C" "G" "T" "A"
 > runLength(rledna)
[1] 2 1 2 1 3 1
 > runValue(rledna)
[1] "A" "T" "C" "G" "T" "A"

Instead of trying to align seq1 to seq2, would it make sense to try
to align runvals1 to runvals2 instead? Where runvals1 to runvals2 are
defined as:

   rledna1 <- Rle(strsplit(seq1, "")[[1]])
   runvals1 <- paste(runValue(rledna1), collapse="")

   rledna2 <- Rle(strsplit(seq2, "")[[1]])
   runvals2 <- paste(runValue(rledna2), collapse="")

Once you've aligned runvals1 to runvals2 you can infer the corresponding
alignment between seq1 and seq2 and infer its score (by now looking at
the lengths of the runs). It would be kind of like doing
Needleman-Wunsch or Smith-Waterman but where the runs are the
unbreakable units, not the individual letters, i.e.
insertions, deletions and replacements are allowed for entire
runs.

Anyway, I'm still not sure why "in the opposite strand" in

   specify a lower penalty for a gap extension when there is a
   homopolymeric insertion *in the opposite strand*

Also, I'm still not clear about this:

   Another thing that would be useful would be per-strand gap creation
   and gap extension penalties.

Thanks,
H.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319