[BioC] Parse sequences so they align properly
Josh Banta [guest]
guest at bioconductor.org
Fri Jan 3 22:46:11 CET 2014
Dear listserv,
I have genetic sequence data in the following form:
ref.sequence <- "ATAGCCGCA"
sequence1 <- "AT[G][C][C]AGCCG[T]CA"
sequence2 <- "ATAGCCGC[C][A][C]A"
sequence3 <- "AT[GCC]AGCCGCA"
The brackets indicate nucleotide insertions relative to the reference sequence ("ref.sequence"). Some sequences may have some/all of the insertions, some may not.
What I want is for all of the loci to "align" properly. Therefore, the sequences lacking a particular insertion should get scored with a dash (or dashes) at that locus.
I want to end up with this:
ref.sequence should look like this: "AT---AGCCG-C---A"
sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A"
sequence2 should look like this: "AT---AGCCG-C[C][A][C]A"
sequence3 should look like this: "AT[G][C][C]AGCCG-C---A"
So how can I make this happen efficiently? This may require only "regular" R commands, and nothing specific to Bioconductor. I do not know. In any case, you are the folks that would know how to do it. If you could help me out with the syntax to get some working code I would be very appreciative!
Thanks very much in advance,
-----------------------------------
Josh Banta, Ph.D
Assistant Professor
Department of Biology
The University of Texas at Tyler
Tyler, TX 75799
Tel: (903) 565-5655
http://plantevolutionaryecology.org
-- output of sessionInfo():
> ref.sequence <- "ATAGCCGCA"
> sequence1 <- "AT[G][C][C]AGCCG[T]CA"
> sequence2 <- "ATAGCCGC[C][A][C]A"
> sequence3 <- "AT[GCC]AGCCGCA"
>#now what?
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list