[BioC] Parse sequences so they align properly

Josh Banta [guest] guest at bioconductor.org
Fri Jan 3 22:46:11 CET 2014


Dear listserv,

I have genetic sequence data in the following form:

ref.sequence <- "ATAGCCGCA"
sequence1 <- "AT[G][C][C]AGCCG[T]CA"
sequence2 <- "ATAGCCGC[C][A][C]A"
sequence3 <- "AT[GCC]AGCCGCA"

The brackets indicate nucleotide insertions relative to the reference sequence ("ref.sequence"). Some sequences may have some/all of the insertions, some may not.

What I want is for all of the loci to "align" properly. Therefore, the sequences lacking a particular insertion should get scored with a dash (or dashes) at that locus.

I want to end up with this:

ref.sequence should look like this: "AT---AGCCG-C---A"
sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A"
sequence2 should look like this: "AT---AGCCG-C[C][A][C]A"
sequence3 should look like this: "AT[G][C][C]AGCCG-C---A"

So how can I make this happen efficiently? This may require only "regular" R commands, and nothing specific to Bioconductor. I do not know. In any case, you are the folks that would know how to do it. If you could help me out with the syntax to get some working code I would be very appreciative!

Thanks very much in advance,
-----------------------------------
Josh Banta, Ph.D
Assistant Professor
Department of Biology
The University of Texas at Tyler
Tyler, TX 75799
Tel: (903) 565-5655
http://plantevolutionaryecology.org

 -- output of sessionInfo(): 

> ref.sequence <- "ATAGCCGCA"
> sequence1 <- "AT[G][C][C]AGCCG[T]CA"
> sequence2 <- "ATAGCCGC[C][A][C]A"
> sequence3 <- "AT[GCC]AGCCGCA"
>#now what?

--
Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list