[BioC] pairwiseAlignment generates different outcomes for the same input sequences

Pages, Herve hpages at fhcrc.org
Tue Mar 15 08:08:19 CET 2011


Hi Alex,

For *quality-based* alignments, pairwiseAlignment() uses the
length of the underlying alphabet to compute the substitution
matrix.

So I did the following change in Biostrings 2.19.17 (devel): when the
input is stored in character vectors, BString or BStringSet objects
(i.e. not DNAString or AAString), pairwiseAlignment() will now consider
that the length of the underlying alphabet is the number of distinct
letters in the input. Not 256 anymore.

The reason for this change: Why 256? Why would we want the result of
an alignment to depend on the internal representation of characters?
If characters were represented on 16 bits (unicode) instead of 8 bits,
then this would change how "AAAA" aligns to "CCCAAGAATTT". Doesn't sound
like a good feature.

So when the input contains 4 distinct letters (like in your situation),
then having the input in DNAString objects or just ordinary character
vectors should always give the same result:

  > pairwiseAlignment("AAAA", "CCCAAGAATTT")
  Global PairwiseAlignedFixedSubject (1 of 1)
  pattern: [1] AAAA 
  subject: [4] AAGA 
  score: -47.95402 

  > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"))
  Global PairwiseAlignedFixedSubject (1 of 1)
  pattern: [1] AAAA 
  subject: [4] AAGA 
  score: -47.95402 

And if the input contains a number of distinct letters close to 4
(i.e. 3 or 5), results should be close too:

  > pairwiseAlignment("xxxx", "CCCxxGxxCCC")
  Global PairwiseAlignedFixedSubject (1 of 1)
  pattern: [1] xxxx 
  subject: [4] xxGx 
  score: -49.02994 

Biostrings 2.19.17 should become available thru biocLite() in the
next 24 hours or so.

Cheers,
H.


----- Original Message -----
From: "Hervé Pagès" <hpages at fhcrc.org>
To: Alogmail2 at aol.com
Cc: bioconductor at r-project.org
Sent: Monday, March 14, 2011 2:57:37 PM
Subject: Re: [BioC] pairwiseAlignment generates different outcomes for the same	input sequences

Hi Alex,

Yes, I find it surprising too that you don't get the same thing.

With this simpler example:

   > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local")
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] AA-AA
   subject: [4] AAGAA
   score: 17.92695

   > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), 
type="global-local")
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] AAAA
   subject: [4] AAGA
   score: 0.0459826

Note that I would *definititely* expect to get the same thing when
a 'substitutionMatrix' is provided. And this seems to be actually the
case (as long as all the letters in 'pattern' and 'subject' are
represented in the substitution matrix):

   > mat <- nucleotideSubstitutionMatrix(match=1, mismatch=-3, 
baseOnly=TRUE)
   > mat
      A  C  G  T
   A  1 -3 -3 -3
   C -3  1 -3 -3
   G -3 -3  1 -3
   T -3 -3 -3  1

   > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local", 
substitutionMatrix=mat)
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] AAAA
   subject: [5] AGAA
   score: 0

   > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), 
type="global-local", substitutionMatrix=mat)
   Global-Local PairwiseAlignedFixedSubject (1 of 1)
   pattern: [1] AAAA
   subject: [5] AGAA
   score: 0

However, by default (i.e. when no 'substitutionMatrix' is provided),
pairwiseAlignment() performs *quality-based* alignments. According
to the man page, this means that, internally, a substitution matrix
is generated (and used) based on the qualities of the letters in the
pattern and the subject (a constant default quality is assigned to
all the letters). It's unclear to me why the code generating this
substitution matrix would return different matrices depending on
whether 'pattern' and 'subject' are ordinary character vectors or
DNAString objects. It seems that the matrix generated for DNAString
objects makes it harder to see gaps in the pattern. Anyway, I need
to have a closer look at this before I can comment more.

Cheers,
H.


On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote:
> Dear Mailing List,
>
> Why does pairwiseAlignment() generate different outcomes for the same input
>   sequences defined differently in terms of classes (see showMethods below):
>
> for
>
> pattern="character", subject="character"
>
> vs.
>
> pattern="DNAString", subject="DNAString"
>
> ?
>
> It generates the same outcomes for the case of
>
>
> pattern="character", subject="character"
>
> vs.
>
> pattern="character", subject="DNAString"
>
>
> It looks like a bug.
>
> Thanks
>
> Alex
>
>
>
>
>
>
> #showMethods("pairwiseAlignment")
> #Function: pairwiseAlignment (package Biostrings)
> #pattern="character",  subject="character"
> #pattern="character",  subject="DNAString"
> #    (inherited from: pattern="character",  subject="XString")
> #pattern="DNAString", subject="DNAString"
>
>> pattern.1
> 50-letter "DNAString" instance
> seq:  CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>> pattern.2
> [1]  "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC"
>> subject.1
> 543-letter "DNAString" instance
> seq:
> AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAGTAGTTACG
> CGTCGT
>>   subject.2
> [1]
> "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCTTCCTATTAC
> AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAATGGCGCAT
> CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGAGGTTTAGG
> CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCATTCGGTGGT
> GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAGCGCCACCG
> CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGC
> CAGTAGTTACGCGTCGT"
>>
>
>>
> pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global-local")
> Global-Local  PairwiseAlignedFixedSubject (1 of 1)
> pattern:   [1]  CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
> subject: [429]  TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
> score: -91.50367
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global-local")
> Global-Local  PairwiseAlignedFixedSubject (1 of 1)
> pattern:  [1]  CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC-AATGGTTGCGCAC
> subject: [80]  CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG--GCCTAC
> score: 95.69296
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global-local")
> Global-Local  PairwiseAlignedFixedSubject (1 of 1)
> pattern:   [1]  CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
> subject: [429]  TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
> score: -91.50367
>
>> sessionInfo()
> R version 2.12.2 (2011-02-25)
> Platform:  i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252   LC_CTYPE=English_United
> States.1252    LC_MONETARY=English_United  States.1252
> [4]  LC_NUMERIC=C                            LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] stats     graphics   grDevices utils     datasets  methods    base
>
> other attached packages:
> [1] altcdfenvs_2.12.0 hypergraph_1.22.0  graph_1.28.0
> makecdfenv_1.28.0  affy_1.28.0       Biobase_2.10.0     GeneR_2.20.0      seqinr_3.0-1
> [9] Biostrings_2.18.4 IRanges_1.8.9      limma_3.6.9
>
> loaded via a namespace (and not attached):
> [1]  affyio_1.18.0          preprocessCore_1.12.0  tools_2.12.2
>
>
>
>
> ###############################
>
> Nutritional Sciences and Toxicology,
> 119 Morgan Hall
> UC.Berkeley,CA 94720
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list