[BioC] pairwiseAlignment generates different outcomes for the same inp...

Tue Mar 15 18:58:29 CET 2011

On 03/14/2011 12:15 PM, Alogmail2 at aol.com wrote:
> Hi Martin,
>
> Thanks for the clarification. Does the variant of   pattern="DNAString",
> subject="DNAString"
>
> look more realistic to  align DNA sequences?

The method for DNAString pattern and subject is definitely the one you
should use to align DNA sequences.

>
> One more question.
>
> What pairwise alignment tool would you recommend if I need to align a
> list#1 of 50-mer probes with a list#2 of more longer contigs (hundreds of
> bases):
>
> I take each 50-mer probe from the list#1 and align it to each contig  from
> the list#2
>
> Finally, for each 50-mer probe I need to find the best match in some terms
> e.g. in terms of alignment scores.
>
> I have tried pairwiseAlignment() but it works slowly and looks unfeasible
> for the list# 1 of 16798 probes vs. the list#2 of 14338 contigs, i.e. when
> output matrix should be 16798 by 14338.

Taking advantage of the fact that pairwiseAlignment() is vectorized
with respect to the pattern:

   library(drosophila2probe)
   probes <- DNAStringSet(drosophila2probe)  # 25-mer probes only
   library(BSgenome.Dmelanogaster.UCSC.dm3)
   contig1 <- subseq(unmasked(Dmelanogaster$chr2L), end=1000)

Then:

   > system.time(pa1 <- pairwiseAlignment(probes[1:16798], contig1, 
type="global-local"))
      user  system elapsed
    34.386   0.172  36.380

Which translates to 6 days for 14338 contigs. Slow but not unfeasible.

>
> I have looked at the mailing list and found suggestions to apply
> matchPattern/vmatchPattern but they provide the fact of a match for a  substring
> within a string rather than alignment score to compare  alignments.

Right, they don't provide a score. But a workaround is to proceed
in 2 passes: first use vmatchPattern with max.mismatch set to the max
nb of indels you are willing to accept, and then compute your own score
for example by computing the "edit distance" between the pattern and
their matches (you can try to use neditStartingAt() or stringDist()
for this).

Cheers,
H.

>
> Thanks
>
> Best regards,
>
> Alex
>
>
>
> In a message dated 3/14/2011 9:04:04 A.M. Pacific Daylight Time,
> mtmorgan at fhcrc.org writes:
>
> On  03/13/2011 06:58 PM, Alogmail2 at aol.com wrote:
>> Dear Mailing  List,
>>
>> Why does pairwiseAlignment() generate different outcomes  for the same
> input
>>    sequences defined differently in terms  of classes (see showMethods
> below):
>>
>> for
>>
>>   pattern="character", subject="character"
>
> Hi Alex --
>
> here  'character' could be anything -- 'the quick brown fox',  whereas
>
>>
>> vs.
>>
>> pattern="DNAString",  subject="DNAString"
>
> here pattern and subject are drawn from a  restricted alphabet and some
> symbols have particular meaning (e.g., 'N'  does not mean the nucleotide
> 'N').
>
> Martin
>
>>
>>   ?
>>
>> It generates the same outcomes for the case  of
>>
>>
>> pattern="character",  subject="character"
>>
>> vs.
>>
>> pattern="character",  subject="DNAString"
>>
>>
>> It looks like a  bug.
>>
>> Thanks
>>
>>   Alex
>>
>>
>>
>>
>>
>>
>>   #showMethods("pairwiseAlignment")
>> #Function: pairwiseAlignment  (package Biostrings)
>> #pattern="character",   subject="character"
>> #pattern="character",   subject="DNAString"
>> #    (inherited from:  pattern="character",  subject="XString")
>> #pattern="DNAString",  subject="DNAString"
>>
>>> pattern.1
>> 50-letter  "DNAString" instance
>> seq:   CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>>>   pattern.2
>> [1]   "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC"
>>>   subject.1
>> 543-letter "DNAString" instance
>> seq:
>>
> AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAGTAGTTACG
>>   CGTCGT
>>>    subject.2
>> [1]
>>
> "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCTTCCTATTAC
>>
> AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAATGGCGCAT
>>
> CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGAGGTTTAGG
>>
> CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCATTCGGTGGT
>>
> GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAGCGCCACCG
>>
> CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGC
>>   CAGTAGTTACGCGTCGT"
>>>
>>
>>>
>>
> pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global-local")
>>   Global-Local  PairwiseAlignedFixedSubject (1 of 1)
>>   pattern:   [1]   CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>> subject:  [429]  TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
>> score:  -91.50367
>>>
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global-local")
>>   Global-Local  PairwiseAlignedFixedSubject (1 of 1)
>> pattern:   [1]  CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC-AATGGTTGCGCAC
>>   subject: [80]   CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG--GCCTAC
>> score:  95.69296
>>>
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global-local")
>>   Global-Local  PairwiseAlignedFixedSubject (1 of 1)
>>   pattern:   [1]   CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>> subject:  [429]  TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
>> score:  -91.50367
>>
>>> sessionInfo()
>> R version 2.12.2  (2011-02-25)
>> Platform:  i386-pc-mingw32/i386  (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United  States.1252   LC_CTYPE=English_United
>> States.1252   LC_MONETARY=English_United  States.1252
>> [4]   LC_NUMERIC=C                   LC_TIME=English_United
>>   States.1252
>>
>> attached base packages:
>> [1] stats   graphics   grDevices utils      datasets  methods    base
>>
>> other attached  packages:
>> [1] altcdfenvs_2.12.0 hypergraph_1.22.0   graph_1.28.0
>> makecdfenv_1.28.0  affy_1.28.0     Biobase_2.10.0     GeneR_2.20.0
> seqinr_3.0-1
>> [9] Biostrings_2.18.4 IRanges_1.8.9     limma_3.6.9
>>
>> loaded via a namespace (and not  attached):
>> [1]  affyio_1.18.0           preprocessCore_1.12.0   tools_2.12.2
>>
>>
>>
>>
>>   ###############################
>>
>> Nutritional Sciences and  Toxicology,
>> 119 Morgan Hall
>> UC.Berkeley,CA  94720
>>
>>
>>      [[alternative HTML version  deleted]]
>>
>>   _______________________________________________
>> Bioconductor mailing  list
>> Bioconductor at r-project.org
>>   https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the  archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100  Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location:  M1-B861
> Telephone: 206  667-2793
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319