[R] Performing basic Multiple Sequence Alignment in R?

Mike Marchywka marchywka at hotmail.com
Sun Dec 26 15:42:07 CET 2010

> From: marchywka at hotmail.com
> To: tal.galili at gmail.com; r-help at r-project.org
> Subject: RE: [R] Performing basic Multiple Sequence Alignment in R?
> Date: Tue, 21 Dec 2010 17:03:17 -0500

> > From: tal.galili at gmail.com
> > Date: Tue, 21 Dec 2010 20:17:18 +0200
> > Subject: Re: [R] Performing basic Multiple Sequence Alignment in R?
> > To: r-help at r-project.org
> >
> >
> > Dear Mike and Thomas,
> >
> > From what I gathered here (Thanks to Joris Meys):
> > http://stackoverflow.com/questions/4497747/how-to-perform-basic-multiple-sequence-alignments-in-r/4498434#4498434
> > There is an R interface to the MUSCLE algorithm in the bio3d package
> > (function seqaln()).
> > But not one for clustal.
> >
> > I will probably end up using pairwiseAlignment on pairs of allignments
> > with some sort of stopping rules (I'll have to play with it to see how
> > it works).
> http://scholar.google.com/scholar?hl=en&q=%22exact+string+matching%22+alignment
> http://citeseerx.ist.psu.edu/search?q=exact+string+matching+alignment+dna&submit=Search&sort=rel
> Certainly if you are flexible and can use whatever may be close in R that
> is fine but I seem to recall that exact string matching was a fast and
> interesting way to go and maybe some of the authors above, in the interest
> of promoting their work, would help implement an R version if there is demand.
> I seem to recall I did something like building indexes of the strings to be aligned
> first, finding substrings that were unique to a given string but appeared only
> once in each of the sequences to be aligned ( this was the most restrictive criterion
> but you can imagine how to make it more accomodating). Now that you got me started,
> up front tokenizing or compiling of input sequences ( usually no more than indexing
> them in some way ) made many later operations like alignment go faster. This
> may have ended up being similar to BLAST but now I can't really recall. Anyway,
> my point here is that some where in R there may be packages that
> generate intermediate forms useful across disciplines- mining data from
> text, linquistics, or macromolecule analysis.  In fact, the indexing process
> helps find things that have migrated a long ways from their original place
> and there are probably other non-alignment related things you could
> get out of the approach.

If you pursue this or make some decision would you please get back to
us, at least me off list? I just went back through my old code and hit the 
search links I posted above, this still seems like quite an interesting
area and the issues do not appear to be confined to bio. Looking at
my method names in my code, it looks like I had a way to supply fixed patterns,
probably from places like PROSITE or CDD, for use as the string you
probably meant to suggest although I seem to think it would make more sense
to discover these based on the strings it finds in the sequences.

I seem to recall I could do 2 sequences reasonably well with some quirks and limitations
but gave up when I tried to do multiple alignments ( actually there was no point
at the time). Recent literature seems to still talk about sub-quadratic time 
although practically for large sequences the real execution time could be dominated
by VM not algorithm order LOL. The indexing also makes it possible to find related
but distant strings, something that may be of interest but not normally
thought of as alignment between strings perturbed in limited ways ( "edit distance"
being rather restricted to a few operations). 

If you find a specific paper or approach that seems to work that may be
of interest to many here and indeed may be implemented under some other name. 


> >

More information about the R-help mailing list