[BioC] pairwiseAlignment generates different outcomes for the same input sequences
Hervé Pagès
hpages at fhcrc.org
Mon Mar 14 22:57:37 CET 2011
Hi Alex,
Yes, I find it surprising too that you don't get the same thing.
With this simpler example:
> pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local")
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AA-AA
subject: [4] AAGAA
score: 17.92695
> pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"),
type="global-local")
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AAAA
subject: [4] AAGA
score: 0.0459826
Note that I would *definititely* expect to get the same thing when
a 'substitutionMatrix' is provided. And this seems to be actually the
case (as long as all the letters in 'pattern' and 'subject' are
represented in the substitution matrix):
> mat <- nucleotideSubstitutionMatrix(match=1, mismatch=-3,
baseOnly=TRUE)
> mat
A C G T
A 1 -3 -3 -3
C -3 1 -3 -3
G -3 -3 1 -3
T -3 -3 -3 1
> pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local",
substitutionMatrix=mat)
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AAAA
subject: [5] AGAA
score: 0
> pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"),
type="global-local", substitutionMatrix=mat)
Global-Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] AAAA
subject: [5] AGAA
score: 0
However, by default (i.e. when no 'substitutionMatrix' is provided),
pairwiseAlignment() performs *quality-based* alignments. According
to the man page, this means that, internally, a substitution matrix
is generated (and used) based on the qualities of the letters in the
pattern and the subject (a constant default quality is assigned to
all the letters). It's unclear to me why the code generating this
substitution matrix would return different matrices depending on
whether 'pattern' and 'subject' are ordinary character vectors or
DNAString objects. It seems that the matrix generated for DNAString
objects makes it harder to see gaps in the pattern. Anyway, I need
to have a closer look at this before I can comment more.
Cheers,
H.
On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote:
> Dear Mailing List,
>
> Why does pairwiseAlignment() generate different outcomes for the same input
> sequences defined differently in terms of classes (see showMethods below):
>
> for
>
> pattern="character", subject="character"
>
> vs.
>
> pattern="DNAString", subject="DNAString"
>
> ?
>
> It generates the same outcomes for the case of
>
>
> pattern="character", subject="character"
>
> vs.
>
> pattern="character", subject="DNAString"
>
>
> It looks like a bug.
>
> Thanks
>
> Alex
>
>
>
>
>
>
> #showMethods("pairwiseAlignment")
> #Function: pairwiseAlignment (package Biostrings)
> #pattern="character", subject="character"
> #pattern="character", subject="DNAString"
> # (inherited from: pattern="character", subject="XString")
> #pattern="DNAString", subject="DNAString"
>
>> pattern.1
> 50-letter "DNAString" instance
> seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>> pattern.2
> [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC"
>> subject.1
> 543-letter "DNAString" instance
> seq:
> AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAGTAGTTACG
> CGTCGT
>> subject.2
> [1]
> "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCTTCCTATTAC
> AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAATGGCGCAT
> CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGAGGTTTAGG
> CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCATTCGGTGGT
> GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAGCGCCACCG
> CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGC
> CAGTAGTTACGCGTCGT"
>>
>
>>
> pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global-local")
> Global-Local PairwiseAlignedFixedSubject (1 of 1)
> pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
> subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
> score: -91.50367
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global-local")
> Global-Local PairwiseAlignedFixedSubject (1 of 1)
> pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC-AATGGTTGCGCAC
> subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG--GCCTAC
> score: 95.69296
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global-local")
> Global-Local PairwiseAlignedFixedSubject (1 of 1)
> pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
> subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
> score: -91.50367
>
>> sessionInfo()
> R version 2.12.2 (2011-02-25)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0
> makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 seqinr_3.0-1
> [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9
>
> loaded via a namespace (and not attached):
> [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2
>
>
>
>
> ###############################
>
> Nutritional Sciences and Toxicology,
> 119 Morgan Hall
> UC.Berkeley,CA 94720
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list