[BioC] pairwise alignment and homology score
Hervé Pagès
hpages at fhcrc.org
Thu Jan 12 20:00:20 CET 2012
Hi again,
On 01/11/2012 04:20 PM, Hervé Pagès wrote:
> Hi,
>
> On 01/10/2012 07:02 AM, erikafalisi at tin.it wrote:
>>
>> Dear List,
>> we are trying to compare more than 600 amino acid sequences in clustal
>> format (or FASTA format too) with BLOSUM62 matrix to achieve homology
>> score. We used pairwise alignment but we aren't able to compare our
>> sequences with BLOSUM matrix because pairwise command is used to match
>> only two strings (and not a set of sequences) with the matrix. How can
>> we iterate these command to perform all sequences alignments?
>
> Assuming you've managed to load your 600 sequences in a AAStringSet
> object 'myseqs', have you tried to do something along the lines of:
>
> scores <- matrix(integer(0), nrow=length(myseqs), ncol=length(myseqs))
> for (i in 1:length(myseqs))
> for (j in 1:i)
> scores[i, j] <- pairwiseAlignment(myseqs[i], myseqs[j],
> substitutionMatrix="BLOSUM62",
> scoreOnly=TRUE)
>
> One problem that arises is that the input of a clustering tool like
> hclust() is expected to be a 'dist' object i.e. a matrix of distances,
> not a matrix of scores. Since, unlike with the distance, the highest
> the score, the "closest" the sequences are, using scores as the input
> of hclust() is guaranteed to produce wrong results.
>
> I'm not sure what's the best way to transform pairwise scores into
> pairwise distances though (I'm not an expert on the topic) but it
> seems to me that the latter could be inferred from the former with
> something like:
>
> dist(x, y) <- 1 - score(x, y)^2 / (score(x, x) * score(y, y))
>
> The distance defined above looks like it could be close enough to a
> true distance (in the mathematical sense) but that's pure speculation
> since I don't really have a proof I could show you (the tricky part
> being to prove it satisfies the triangle inequality). Anyway, as far
> as hclust() is concerned, I'm not sure it matters. According to the
> man page, the first argument to hclust() only needs to be
> "a dissimilarity structure as produced by ‘dist’". Nothing is said
> about it being a true distance.
>
> So:
>
> dist <- 1 - scores^2 / (diag(scores) %*% t(diag(scores)))
An update on this. With the possibility of negative scores, the
transformation above from pairwise scores to pairwise (pseudo)distances
is not doing the right thing (giving the same result for a negative
score and its opposite).
Seems like a more accurate (and simpler) way to go is to just make the
(pseudo)distances (or dissimilarity measures) be the opposite of the
scores. On a concrete example:
myseqs <- AAStringSet(c("PAHHEAE", # cluster 1
"MGBB", # cluster 2
"PAHEAE", # cluster 1
"PAWHEAE", # cluster 1
"VGGDN", # cluster 2
"PAYHEAE", # cluster 1
"MGGBB", # cluster 2
"VGGBB", # cluster 2
"KMCEKR")) # cluster 3
scores <- matrix(integer(0), nrow=length(myseqs), ncol=length(myseqs))
for (i in 1:length(myseqs))
for (j in 1:i)
scores[i, j] <- pairwiseAlignment(myseqs[i], myseqs[j],
substitutionMatrix="BLOSUM62",
scoreOnly=TRUE)
Comparing different approaches:
a.
dist <- 1 - scores^2 / (diag(scores) %*% t(diag(scores)))
plot(hclust(as.dist(dist)))
Not good. In particular seqs 2 and 6 look very similar on the plot
(they are very close) but their pairwise score (scores[6, 2] is -25)
indicates that they are not.
b.
dist = -scores
plot(hclust(as.dist(dist)))
Much better. The plot reflects the expected clustering.
Note that a result similar to b. can be obtained by computing the
pairwise "edit distances" (aka Levenshtein distances):
c.
dist <- stringDist(myseqs, method="levenshtein")
plot(hclust(as.dist(dist)))
The clustering obtained with this method looks a lot like the one
obtained in b. but this method is slightly more efficient since
stringDist() takes care of the looping at the C level.
See ?stringDist for more details.
Finally be aware that stringDist() should not be used with 'method'
set to "quality" or "substitutionMatrix" since in that case the returned
object will contain the pairwise scores (i.e. high values mean similar
and low values mean dissimilar) and therefore is not suitable for
clustering. I'm planning to fix this by populating the returned object
with the opposites of the scores. In the mean time you can do:
d.
dist <- -stringDist(myseqs, method="substitutionMatrix",
substitutionMatrix=BLOSUM62,
gapOpening=-10, gapExtension=-4)
plot(hclust(dist))
This produces *exactly* the same result as b. but with slightly
more compact/efficient code.
Cheers,
H.
>
> Finally, plot the dendogram with:
>
> plot(hclust(as.dist(dist)))
>
>> We also performed Multiple Alignment but we obtain a cramped
>> dendogram. Is it possible to modify dendogram's sizes to make it
>> readable?
>
> What's a "cramped" dendogram? Not sure about modifying the size of the
> dendogram. I suggest you look at the man page for hclust() and in
> particular at the plot method for class 'hclust' (it has a lot of args).
> Others on the list might have better suggestions about producing
> nice dendogram plots.
>
> Cheers,
> H.
>
>> Thanks,
>> Dr Falisi
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list