[BioC] Help about 'pairwiseAlignment' in Biostrings package
Hervé Pagès
hpages at fhcrc.org
Wed Mar 9 19:41:25 CET 2011
Hi LiGang,
Note that by default, pairwiseAlignment() performs "global" alignment,
not "local". However, the example you provided looks more like
a needle/haystack situation (short pattern that must be fully aligned
to a substring of a long subject), so it looks like the type of
alignment you are looking for is "global-local".
> library(Biostrings)
> set.seed(12345)
> toquery <- DNAString(paste(sample(c("A","G","C","T"),200,
rep=TRUE),collapse=""))
> toinsert <- DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC')
> to.subject<-c(toquery, toinsert, subseq(toquery, start=30,
end=150),toinsert)
> pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject,
type="global-local")
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1]
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
subject: [251]
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
score: 140.7047
Note that this result is *very* different from what you get when doing
"global" alignment:
> pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject)
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1]
CTT--------------------------------...TCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
subject: [1]
CTTTGAGCCTAACAGGGGATGGTCCGCCAGTAACG...TCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
score: -1119.295
This is because when doing "global" ("global" is short for "global-
global"), pairwiseAlignment() tries to align the pattern to the full
subject, so in your case (short pattern) it needs to "stretch" the
pattern a lot (by introducing gaps in it) to make it "fit" the subject.
Anyway, using "global-local" doesn't solve the problem that, when more
than one match achieves the best possible score, pairwiseAlignment()
only reports one of them (the last one I believe). This is a known
limitation of pairwiseAlignment().
Here are 2 possible workarounds/alternatives:
1. If you really need all the richness/flexibility of
the pairwiseAlignment() function (i.e. its ability to support
all kinds of scoring scheme thru the 'patternQuality',
'subjectQuality', 'substitutionMatrix', 'fuzzyMatrix',
'gapOpening' and 'gapExtension' arguments), then
you can call it again on the truncated subject to find any
other potential match:
> pairwiseAlignment(subseq(toquery,start=50,end=120),
subseq(to.subject, end=250), type="global-local")
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1]
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
subject: [50]
CTTGACGCAGATGTTTATACTCCGGACTCCATCAAAGTCTATTACCCCCAGGCTCCTCTGACGTGTTTCAG
score: 140.7047
You'll need to write your own function that uses a loop to truncate
and search again. At each step you'll want to look at the score of
the new match you get (the score should only go down) and make sure
that it's close enough to the score of the original match. You exit
the loop if it's not.
Note that this solution doesn't deal properly with the (rare)
situation where 2 best matches are overlapping. Also, it's
probably not going to be very efficient.
2. If you don't need all the richness/flexibility of
pairwiseAlignment(), and if it's ok for you to allow only a small
number of indels, then you could always use the matchPattern()
function:
> matchPattern(subseq(toquery,start=50,end=120), to.subject,
max.mismatch=10, with.indels=TRUE)
Views on a 381-letter DNAString subject
subject:
CTTTGAGCCTAACAGGGGATGGTCCGCCAGTAAC...TGTGGCGGTGCCCGGGGCATCTCCGCGGCGGAAC
views:
start end width
[1] 50 120 71
[CTTGACGCAGATGTTTATACTCCGGACT...CCCCCAGGCTCCTCTGACGTGTTTCAG]
[2] 251 321 71
[CTTGACGCAGATGTTTATACTCCGGACT...CCCCCAGGCTCCTCTGACGTGTTTCAG]
It's less flexible than pairwiseAlignment() but it returns all the
matches. It's also faster and more memory efficient. See
?matchPattern for the details.
Cheers,
H.
On 03/08/2011 10:05 PM, ligang wrote:
> Hello list,
>
> I'm using 'pairwiseAlignment' function of Biostrings package to do sequences
> alignment, and found that when do local alignment, if there were several
> fragments matched between 'pattern' and 'subject' sequence, only one was found
> by 'pairwiseAlignment' function, see the following example:
>
>
> #######################
>
> set.seed(12345)
> DNAString(paste(sample(c("A","G","C","T"),200, rep=TRUE),collapse=""))->toquery
> DNAString('GCGGTGCCCGGGGCATCTCCGCGGCGGAAC')->toinsert
> to.subject<-c(toquery, toinsert, subseq(toquery, start=30, end=150),toinsert)
> pairwiseAlignment(subseq(toquery,start=50,end=120), to.subject)
>
>
> #######################
>
> how could I found all the "matches"?
>
> LiGang
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list