[BioC] Performing basic Multiple Sequence Alignment in R?
Thomas Girke
thomas.girke at ucr.edu
Tue Dec 21 17:39:07 CET 2010
To computer the multiple alignments, you need to use external programs
like dialign, ClustalW, T-Coffee, MUSCLE or Multalin. If you are familiar with
the command-line then you can execute them from R with the system() function,
and then import the resulting alignment files into R as XMultipleAlignment objects.
For details see: http://bioconductor.org/packages/release/bioc/html/Biostrings.html.
For handling multiple alignements in R, you also may want take a look at some of these examples
http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Multiple-Sequence-Alignments-MSAs-
or the ape library at CRAN.
Thomas
On Tue, Dec 21, 2010 at 11:29:05AM +0200, Tal Galili wrote:
> Hello everyone,
>
> I am not sure if this should go on the general R mailing list (for example,
> if there is a text mining solution that might work here) or the bioconductor
> mailing list (since I wasn't able to find a solution to my question on
> searching their lists) - so this time I tried both, and in the future I'll
> know better (in case it should go to only one of the two).
>
>
> The task I'm trying to achieve is to align several sequences together.
> I don't have a basic pattern to match to. All that I know is that the
> "True" pattern should be of length "30" and that the sequences I'm looking
> at, have had missing values introduced to them at random points.
> Here is an example of such sequences, were on the left we see what is the
> real location of the missing values, and on the right we see the sequence
> that we will be able to observe. My goal is to reconstruct the left column
> using only the sequences I've got on the right column (based on the fact
> that many of the letters in each position are the same)
>
> Real_sequence The_sequence_we_see
>
> 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
>
> 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
>
> 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
>
> 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
>
> 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
>
> 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
>
> 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
>
>
> Here is an example code to reproduce the above example:
>
> ATCG <- c("A","T","C","G")
>
> set.seed(40)
>
>
> original.seq <- sample(ATCG, 30, T)
>
>
> seqS <- matrix(original.seq,200,30, T)
>
>
> change.letters <- function(x, number.of.changes = 15,
> letters.to.change.with = ATCG)
>
> {
>
>
> number.of.changes <- sample(seq_len(number.of.changes), 1)
>
>
> new.letters <- sample(letters.to.change.with , number.of.changes, T)
>
>
> where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
>
>
> x[where.to.change.the.letters] <- new.letters
>
>
> return(x)
>
> }
>
>
> change.letters(original.seq)
>
>
> insert.missing.values <- function(x) change.letters(x, 3, "-")
>
>
> insert.missing.values(original.seq)
>
>
> seqS2 <- t(apply(seqS, 1, change.letters))
>
>
> seqS3 <- t(apply(seqS2, 1, insert.missing.values))
>
>
> seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
>
> require(stringr)
>
> # library(help=stringr)
>
>
> all.seqS <- str_replace(seqS4,"-" , "")
>
>
> # how do we allign this?
>
>
> data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
>
>
> I understand that if all I had was a string and a pattern I would be able to
> use
>
> library(Biostrings)
>
>
> pairwiseAlignment(...)
>
>
>
> But in the case I present we are dealing with many sequences to align to one
> another (instead of aligning them to one pattern).
>
> Is there a known method for doing this in R?
>
>
> Thanks,
>
> Tal
>
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com | 972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list