[BioC] Parse sequences so they align properly

Sat Jan 4 08:37:19 CET 2014

Hi Josh,

On 01/03/2014 01:46 PM, Josh Banta [guest] wrote:
>
> Dear listserv,
>
> I have genetic sequence data in the following form:
>
> ref.sequence <- "ATAGCCGCA"
> sequence1 <- "AT[G][C][C]AGCCG[T]CA"
> sequence2 <- "ATAGCCGC[C][A][C]A"
> sequence3 <- "AT[GCC]AGCCGCA"
>
> The brackets indicate nucleotide insertions relative to the reference sequence ("ref.sequence"). Some sequences may have some/all of the insertions, some may not.

The fact that you represent the same 3-letter insertion in 2 different
ways (e.g. [G][C][C] in sequence1 and [GCC] in sequence3) makes things
a little bit more complicated than they need to be. So I'm going to
assume that brackets are around all *individual* inserted letters,
that is:

   ref.sequence <- "ATAGCCGCA"
   sequences <-  c("AT[G][C][C]AGCCG[T]CA",
                   "ATAGCCGC[C][A][C]A",
                   "AT[G][C][C]AGCCGCA")
>
> What I want is for all of the loci to "align" properly. Therefore, the sequences lacking a particular insertion should get scored with a dash (or dashes) at that locus.
>
> I want to end up with this:
>
> ref.sequence should look like this: "AT---AGCCG-C---A"
> sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A"
> sequence2 should look like this: "AT---AGCCG-C[C][A][C]A"
> sequence3 should look like this: "AT[G][C][C]AGCCG-C---A"
>
> So how can I make this happen efficiently? This may require only "regular" R commands, and nothing specific to Bioconductor. I do not know. In any case, you are the folks that would know how to do it. If you could help me out with the syntax to get some working code I would be very appreciative!

Why do you want to keep the brackets in the "stretched" versions of the
sequences? They're not needed. It makes the alignments really hard to
see and a lot of downstream tools are probably going to choke on those
brackets. It would be better to just tend up with:

   stretched.ref.sequence <- "AT---AGCCG-C---A"
   stretched.sequences <-  c("ATGCCAGCCGTC---A",
                             "AT---AGCCG-CCACA",
                             "ATGCCAGCCG-C---A")

Note that now all the stretched sequences have the same length.

So we need 2 functions, stretchRefSequence() and stretchSequences(),
that will perform this stretching:

   > stretchRefSequence(ref.sequence, sequences)
   [1] "AT---AGCCG-C---A"

   > stretchSequences(ref.sequence, sequences)
   [1] "ATGCCAGCCGTC---A" "AT---AGCCG-CCACA" "ATGCCAGCCG-C---A"

Here is how they can be implemented (only lightly tested):

   library(IRanges)  # for the isSingleString() and
                     # IRanges:::fancy_mseq() functions

   .computeInsSizes <- function(sequence)
   {
     tmp <- strsplit(gsub("\\[.\\]", "-", sequence), "", fixed=TRUE)[[1L]]
     head(runLength(Rle(cumsum(tmp != "-"))) - 1L, n=-1L)
   }

   computeInsSizesInRef <- function(ref.sequence, sequences)
   {
     if (!isSingleString(ref.sequence))
         stop("'ref.sequence' must be a single string")
     if (!is.character(sequences))
         stop("'sequences' must be a character vector")
     ans <- integer(nchar(ref.sequence) - 1L)
     for (i in seq_along(sequences)) {
         seq <- sequences[i]
         if (gsub("\\[.\\]", "", seq) != ref.sequence)
             stop("'sequences[", i, "]' is not compatible with 
'ref.sequence'")
         ins_sizes <- .computeInsSizes(seq)
         ans <- pmax(ans, ins_sizes)
     }
     ans
   }

   .insertGapsAt <- function(x, at, gap_sizes)
   {
     ans_len <- nchar(x) + sum(gap_sizes)
     ans <- character(ans_len)
     ans[] <- "-"
     offset <- at + c(0L, cumsum(head(gap_sizes, n=-1L)))
     idx <- IRanges:::fancy_mseq(gap_sizes, offset=offset)
     ans[-idx] <- strsplit(x, "", fixed=TRUE)[[1L]]
     paste(ans, collapse="")
   }

   stretchRefSequence <- function(ref.sequence, sequences)
   {
     ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences)
     ins_at <- seq_len(nchar(ref.sequence) - 1L)
     .insertGapsAt(ref.sequence, ins_at, ins_sizes_in_ref)
   }

   stretchSequences <- function(ref.sequence, sequences)
   {
     ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences)
     sapply(sequences,
            function(seq) {
                ins_sizes <- .computeInsSizes(seq)
                naked_seq <- gsub("[\\[\\]]", "", seq, perl=TRUE)
                gap_sizes <- ins_sizes_in_ref - ins_sizes
                ins_at <- seq_along(ins_sizes_in_ref) + cumsum(ins_sizes)
                .insertGapsAt(naked_seq, ins_at, gap_sizes)
            },
            USE.NAMES=FALSE)
   }

Cheers,
H.

>
> Thanks very much in advance,
> -----------------------------------
> Josh Banta, Ph.D
> Assistant Professor
> Department of Biology
> The University of Texas at Tyler
> Tyler, TX 75799
> Tel: (903) 565-5655
> http://plantevolutionaryecology.org
>
>   -- output of sessionInfo():
>
>> ref.sequence <- "ATAGCCGCA"
>> sequence1 <- "AT[G][C][C]AGCCG[T]CA"
>> sequence2 <- "ATAGCCGC[C][A][C]A"
>> sequence3 <- "AT[GCC]AGCCGCA"
>> #now what?
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319