[R] Performing basic Multiple Sequence Alignment in R?
David Winsemius
dwinsemius at comcast.net
Tue Dec 21 14:06:19 CET 2010
Tal; I'm trimming the BioC posting. In the R lists it is considered
spamming to cross post. (Please re-read the Posting Guide.)
On Dec 21, 2010, at 4:21 AM, Tal Galili wrote:
> Hello everyone,
>
> I am not sure if this should go on the general R mailing list (for
> example,
> if there is a text mining solution that might work here) or the
> bioconductor
> mailing list (since I wasn't able to find a solution to my question on
> searching their lists) - so this time I tried both, and in the
> future I'll
> know better (in case it should go to only one of the two).
>
>
> The task I'm trying to achieve is to align several sequences together.
> I don't have a basic pattern to match to. All that I know is that the
> "True" pattern should be of length "30" and that the sequences I'm
> looking
> at, have had missing values introduced to them at random points.
> Here is an example of such sequences, were on the left we see what
> is the
> real location of the missing values, and on the right we see the
> sequence
> that we will be able to observe. My goal is to reconstruct the left
> column
> using only the sequences I've got on the right column (based on the
> fact
> that many of the letters in each position are the same)
>
> Real_sequence The_sequence_we_see
> 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
> 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
> 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
> 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
> 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
> 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
> 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
>
The agrep function allows one to specify which sort of differences to
consider in calculating a Levenshtein edit distance. Insertions are
one possible distance component. You could take a look at its code (in
C in hte sources) and perhaps rejigger it to spit out the location of
the deletions.
> agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence,
max.distance=list(deletions=0, substitutions=0, insertions=0))
integer(0)
> agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence,
max.distance=list(deletions=0, substitutions=0, insertions=1))
[1] 1
--
David.
>
> Here is an example code to reproduce the above example:
>
> ATCG <- c("A","T","C","G")
> set.seed(40)
>
> original.seq <- sample(ATCG, 30, T)
>
> seqS <- matrix(original.seq,200,30, T)
>
> change.letters <- function(x, number.of.changes = 15,
> letters.to.change.with = ATCG)
> {
>
> number.of.changes <- sample(seq_len(number.of.changes), 1)
>
> new.letters <- sample(letters.to.change.with , number.of.changes,
> T)
>
> where.to.change.the.letters <- sample(seq_along(x) ,
> number.of.changes, F)
>
> x[where.to.change.the.letters] <- new.letters
>
> return(x)
> }
>
> change.letters(original.seq)
>
> insert.missing.values <- function(x) change.letters(x, 3, "-")
>
> insert.missing.values(original.seq)
>
> seqS2 <- t(apply(seqS, 1, change.letters))
>
> seqS3 <- t(apply(seqS2, 1, insert.missing.values))
>
> seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
> require(stringr)
> # library(help=stringr)
>
> all.seqS <- str_replace(seqS4,"-" , "")
>
> # how do we allign this?
>
> data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
>
>
> I understand that if all I had was a string and a pattern I would be
> able to
> use
>
> library(Biostrings)
>
> pairwiseAlignment(...)
>
>
>
> But in the case I present we are dealing with many sequences to
> align to one
> another (instead of aligning them to one pattern).
>
> Is there a known method for doing this in R?
>
>
> Thanks,
>
> Tal
>
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com | 972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il
> (Hebrew) |
> www.r-statistics.com (English)
> ----------------------------------------------------------------------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list