[BioC] [R] Performing basic Multiple Sequence Alignment in R?

Tue Dec 21 13:44:54 CET 2010

I don't have an answer, trying to solicit more input with additional questions.

> From: tal.galili at gmail.com
> Date: Tue, 21 Dec 2010 11:21:03 +0200
> To: r-help at r-project.org; bioconductor at r-project.org
> Subject: [R] Performing basic Multiple Sequence Alignment in R?
>
> Hello everyone,
>
> I am not sure if this should go on the general R mailing list (for example,
> if there is a text mining solution that might work here) or the bioconductor
> mailing list (since I wasn't able to find a solution to my question on
> searching their lists) - so this time I tried both, and in the future I'll
> know better (in case it should go to only one of the two).
>

I take it you don't want an R interface for clustal and I seem
to recall, from doing this a few years ago, that alignment by
exact string matching was a bit of a research area ( I think you
can find papers on citeseer for example). It does seem you are asking
about exact string matches for alignment markers- your left sequences
appear exactly someplace on the right- but your overall interests
are not real clear. I never got my code fully working but I was
happy that I could do different strains of e coli ( or something in 
the 5-10 Mbp genome range ) very quickly ( seconds as I recall ) and
you could also presumably find similar items that had
moved a long way. 

Earlier someone came
here with a task and was pointed to bio packages but I 
thought there may be something in computational linguistics or mining
better suited to needs but no one ever volunteered anything.

>
> The task I'm trying to achieve is to align several sequences together.
> I don't have a basic pattern to match to. All that I know is that the
> "True" pattern should be of length "30" and that the sequences I'm looking
> at, have had missing values introduced to them at random points.

Alternatively I guess someone could make an R interface for various
BLAST's, sometimes the help desk at NCBI can get questions like this
to the right person internally.

> Here is an example of such sequences, were on the left we see what is the
> real location of the missing values, and on the right we see the sequence
> that we will be able to observe. My goal is to reconstruct the left column
> using only the sequences I've got on the right column (based on the fact
> that many of the letters in each position are the same)
>
> Real_sequence The_sequence_we_see
> 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
> 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
> 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
> 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
> 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
> 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
> 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
>