[BioC] pairwiseAlignment generates different outcomes for the same inp...
Hervé Pagès
hpages at fhcrc.org
Tue Mar 15 18:58:29 CET 2011
On 03/14/2011 12:15 PM, Alogmail2 at aol.com wrote:
> Hi Martin,
>
> Thanks for the clarification. Does the variant of pattern="DNAString",
> subject="DNAString"
>
> look more realistic to align DNA sequences?
The method for DNAString pattern and subject is definitely the one you
should use to align DNA sequences.
>
> One more question.
>
> What pairwise alignment tool would you recommend if I need to align a
> list#1 of 50-mer probes with a list#2 of more longer contigs (hundreds of
> bases):
>
> I take each 50-mer probe from the list#1 and align it to each contig from
> the list#2
>
> Finally, for each 50-mer probe I need to find the best match in some terms
> e.g. in terms of alignment scores.
>
> I have tried pairwiseAlignment() but it works slowly and looks unfeasible
> for the list# 1 of 16798 probes vs. the list#2 of 14338 contigs, i.e. when
> output matrix should be 16798 by 14338.
Taking advantage of the fact that pairwiseAlignment() is vectorized
with respect to the pattern:
library(drosophila2probe)
probes <- DNAStringSet(drosophila2probe) # 25-mer probes only
library(BSgenome.Dmelanogaster.UCSC.dm3)
contig1 <- subseq(unmasked(Dmelanogaster$chr2L), end=1000)
Then:
> system.time(pa1 <- pairwiseAlignment(probes[1:16798], contig1,
type="global-local"))
user system elapsed
34.386 0.172 36.380
Which translates to 6 days for 14338 contigs. Slow but not unfeasible.
>
> I have looked at the mailing list and found suggestions to apply
> matchPattern/vmatchPattern but they provide the fact of a match for a substring
> within a string rather than alignment score to compare alignments.
Right, they don't provide a score. But a workaround is to proceed
in 2 passes: first use vmatchPattern with max.mismatch set to the max
nb of indels you are willing to accept, and then compute your own score
for example by computing the "edit distance" between the pattern and
their matches (you can try to use neditStartingAt() or stringDist()
for this).
Cheers,
H.
>
> Thanks
>
> Best regards,
>
> Alex
>
>
>
> In a message dated 3/14/2011 9:04:04 A.M. Pacific Daylight Time,
> mtmorgan at fhcrc.org writes:
>
> On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote:
>> Dear Mailing List,
>>
>> Why does pairwiseAlignment() generate different outcomes for the same
> input
>> sequences defined differently in terms of classes (see showMethods
> below):
>>
>> for
>>
>> pattern="character", subject="character"
>
> Hi Alex --
>
> here 'character' could be anything -- 'the quick brown fox', whereas
>
>>
>> vs.
>>
>> pattern="DNAString", subject="DNAString"
>
> here pattern and subject are drawn from a restricted alphabet and some
> symbols have particular meaning (e.g., 'N' does not mean the nucleotide
> 'N').
>
> Martin
>
>>
>> ?
>>
>> It generates the same outcomes for the case of
>>
>>
>> pattern="character", subject="character"
>>
>> vs.
>>
>> pattern="character", subject="DNAString"
>>
>>
>> It looks like a bug.
>>
>> Thanks
>>
>> Alex
>>
>>
>>
>>
>>
>>
>> #showMethods("pairwiseAlignment")
>> #Function: pairwiseAlignment (package Biostrings)
>> #pattern="character", subject="character"
>> #pattern="character", subject="DNAString"
>> # (inherited from: pattern="character", subject="XString")
>> #pattern="DNAString", subject="DNAString"
>>
>>> pattern.1
>> 50-letter "DNAString" instance
>> seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>>> pattern.2
>> [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC"
>>> subject.1
>> 543-letter "DNAString" instance
>> seq:
>>
> AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAGTAGTTACG
>> CGTCGT
>>> subject.2
>> [1]
>>
> "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCTTCCTATTAC
>>
> AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAATGGCGCAT
>>
> CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGAGGTTTAGG
>>
> CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCATTCGGTGGT
>>
> GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAGCGCCACCG
>>
> CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGC
>> CAGTAGTTACGCGTCGT"
>>>
>>
>>>
>>
> pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global-local")
>> Global-Local PairwiseAlignedFixedSubject (1 of 1)
>> pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>> subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
>> score: -91.50367
>>>
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global-local")
>> Global-Local PairwiseAlignedFixedSubject (1 of 1)
>> pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC-AATGGTTGCGCAC
>> subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG--GCCTAC
>> score: 95.69296
>>>
>>
> pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global-local")
>> Global-Local PairwiseAlignedFixedSubject (1 of 1)
>> pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC
>> subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC
>> score: -91.50367
>>
>>> sessionInfo()
>> R version 2.12.2 (2011-02-25)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>> States.1252 LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C LC_TIME=English_United
>> States.1252
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0
>> makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0
> seqinr_3.0-1
>> [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9
>>
>> loaded via a namespace (and not attached):
>> [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2
>>
>>
>>
>>
>> ###############################
>>
>> Nutritional Sciences and Toxicology,
>> 119 Morgan Hall
>> UC.Berkeley,CA 94720
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list