[BioC] pairwiseAlignment generates different outcomes for the same input sequences
Pages, Herve
hpages at fhcrc.org
Tue Mar 15 08:08:19 CET 2011
Hi Alex,
For *quality-based* alignments, pairwiseAlignment() uses the
length of the underlying alphabet to compute the substitution
matrix.
So I did the following change in Biostrings 2.19.17 (devel): when the
input is stored in character vectors, BString or BStringSet objects
(i.e. not DNAString or AAString), pairwiseAlignment() will now consider
that the length of the underlying alphabet is the number of distinct
letters in the input. Not 256 anymore.
The reason for this change: Why 256? Why would we want the result of
an alignment to depend on the internal representation of characters?
If characters were represented on 16 bits (unicode) instead of 8 bits,
then this would change how "AAAA" aligns to "CCCAAGAATTT". Doesn't sound
like a good feature.
So when the input contains 4 distinct letters (like in your situation),
then having the input in DNAString objects or just ordinary character
vectors should always give the same result:
> pairwiseAlignment("AAAA", "CCCAAGAATTT")
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AAAA
subject: [4] AAGA
score: -47.95402
> pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"))
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AAAA
subject: [4] AAGA
score: -47.95402
And if the input contains a number of distinct letters close to 4
(i.e. 3 or 5), results should be close too:
> pairwiseAlignment("xxxx", "CCCxxGxxCCC")
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] xxxx
subject: [4] xxGx
score: -49.02994
Biostrings 2.19.17 should become available thru biocLite() in the
next 24 hours or so.
Cheers,
H.
----- Original Message -----
From: "Hervé Pagès" <hpages at fhcrc.org>
To: Alogmail2 at aol.com
Cc: bioconductor at r-project.org
Sent: Monday, March 14, 2011 2:57:37 PM
Subject: Re: [BioC] pairwiseAlignment generates different outcomes for the same input sequences
Hi Alex,
Yes, I find it surprising too that you don't get the same thing.
With this simpler example:
> pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local")
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AA-AA
subject: [4] AAGAA
score: 17.92695
> pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"),
type="global-local")
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AAAA
subject: [4] AAGA
score: 0.0459826
Note that I would *definititely* expect to get the same thing when
a 'substitutionMatrix' is provided. And this seems to be actually the
case (as long as all the letters in 'pattern' and 'subject' are
represented in the substitution matrix):
> mat <- nucleotideSubstitutionMatrix(match=1, mismatch=-3,
baseOnly=TRUE)
> mat
A C G T
A 1 -3 -3 -3
C -3 1 -3 -3
G -3 -3 1 -3
T -3 -3 -3 1
> pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local",
substitutionMatrix=mat)
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AAAA
subject: [5] AGAA
score: 0
> pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"),
type="global-local", substitutionMatrix=mat)
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AAAA
subject: [5] AGAA
score: 0
However, by default (i.e. when no 'substitutionMatrix' is provided),
pairwiseAlignment() performs *quality-based* alignments. According
to the man page, this means that, internally, a substitution matrix
is generated (and used) based on the qualities of the letters in the
pattern and the subject (a constant default quality is assigned to
all the letters). It's unclear to me why the code generating this
substitution matrix would return different matrices depending on
whether 'pattern' and 'subject' are ordinary character vectors or
DNAString objects. It seems that the matrix generated for DNAString
objects makes it harder to see gaps in the pattern. Anyway, I need
to have a closer look at this before I can comment more.
Cheers,
H.
On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote:
> Dear Mailing List,
>
> Why does pairwiseAlignment() generate different outcomes for the same input
> sequences defined differently in terms of classes (see showMethods below):
>
> for
>
> pattern="character", subject="character"
>
> vs.
>
> pattern="DNAString", subject="DNAString"
>
> ?
>
> It generates the same outcomes for the case of
>
>
> pattern="character", subject="character"
>
> vs.
>
> pattern="character", subject="DNAString"
>
>
> It looks like a bug.
>
> Thanks
>
> Alex
>
>
>
>
>
>
> #showMethods("pairwiseAlignment")
> #Function: pairwiseAlignment (package Biostrings)
> #pattern="character", subject="character"
> #pattern="character", subject="DNAString"
> # (inherited from: pattern="character", subject="XString")
> #pattern="DNAString", subject="DNAString"
>
>> pattern.1
> 50-letter "DNAString" instance
> seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>> pattern.2
> [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC"
>> subject.1
> 543-letter "DNAString" instance
> seq:
> AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAGTAGTTACG
> CGTCGT
>> subject.2
> [1]
> "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCTTCCTATTAC
> AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAATGGCGCAT
> CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGAGGTTTAGG
> CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCATTCGGTGGT
> GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAGCGCCACCG
> CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGC
> CAGTAGTTACGCGTCGT"
>>
>
>>
> pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global-local")
> Global-Local PairwiseAlignedFixedSubject (1 of 1)
> pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
> subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
> score: -91.50367
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global-local")
> Global-Local PairwiseAlignedFixedSubject (1 of 1)
> pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC-AATGGTTGCGCAC
> subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG--GCCTAC
> score: 95.69296
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global-local")
> Global-Local PairwiseAlignedFixedSubject (1 of 1)
> pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
> subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
> score: -91.50367
>
>> sessionInfo()
> R version 2.12.2 (2011-02-25)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0
> makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 seqinr_3.0-1
> [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9
>
> loaded via a namespace (and not attached):
> [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2
>
>
>
>
> ###############################
>
> Nutritional Sciences and Toxicology,
> 119 Morgan Hall
> UC.Berkeley,CA 94720
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list