[BioC] Parse sequences so they align properly

Sun Jan 5 07:18:17 CET 2014

Hi Joshua,

On 01/04/2014 02:05 PM, Joshua Banta wrote:
> Dear Hervé,
>
> Thank you very much for your response.
>
> Your script has a line of code that reads:
>
>   if (gsub("\\[.\\]", "", seq) != ref.sequence)
> stop("'sequences[", i, "]' is not compatible with 'ref.sequence'")
>
> I take it this means that your functions do not allow for polymorphisms
> and other variations. For instance:
>
> ref.sequence <- "GCAAATT"
> sequences <- read.table(textConnection("name sequence
> sequence1 GC-AATT
> sequence2 GCAAAT[T]T
> sequence3 GCAAAGT"), header = TRUE, as.is <http://as.is> = TRUE)
> closeAllConnections()
>
> Sequence1 has a deletion (relative to the reference sequence) at
> position 3, sequence 2 has an insertion (relative to the reference
> sequence) between positions 5 and 6, and sequence3 has a G instead of a
> T (relative to the reference sequence) at position 6. (I am numbering
> the positions from left to right, using the reference sequence to assign
> the position numbers.)
>
> What I want to get is this:
>
> reference.sequence: "GCAAAT-T
> sequence1: GC-AAT-T
> sequence2: GCAAATTT
> sequence3: GCAAAG-T

Based on the description you gave of your data, it seemed that your
sequences only had insertions with respect to the reference sequence.
So the code I provided only supported insertions. Below I modified
the original code to support insertions, deletions, and replacements.
I also fixed a bug in the .insertGaps() helper function when there is
nothing to insert.

   library(IRanges)  # for the Rle(), runLength(), isSingleString(),
                     # and IRanges:::fancy_mseq() functions

   .isCompatibleWithRef <- function(sequence, ref.sequence)
   {
     seq0 <- gsub("\\[.\\]", "", sequence)
     nchar(seq0) == nchar(ref.sequence)
   }

   .computeInsSizes <- function(sequence)
   {
     tmp <- strsplit(gsub("\\[.\\]", "i", sequence), "", fixed=TRUE)[[1L]]
     head(runLength(Rle(cumsum(tmp != "i"))) - 1L, n=-1L)
   }

   computeInsSizesInRef <- function(ref.sequence, sequences)
   {
     if (!isSingleString(ref.sequence))
         stop("'ref.sequence' must be a single string")
     if (!is.character(sequences))
         stop("'sequences' must be a character vector")
     ans <- integer(nchar(ref.sequence) - 1L)
     for (i in seq_along(sequences)) {
         seq <- sequences[i]
         if (!.isCompatibleWithRef(seq, ref.sequence))
             stop("'sequences[", i, "]' is not compatible with 
'ref.sequence'")
         ins_sizes <- .computeInsSizes(seq)
         ans <- pmax(ans, ins_sizes)
     }
     ans
   }

   .insertGaps <- function(x, after, gap_sizes)
   {
     sum_gap_sizes <- sum(gap_sizes)
     if (sum_gap_sizes == 0L)
         return(x)
     ans_len <- nchar(x) + sum_gap_sizes
     ans <- character(ans_len)
     ans[] <- "-"
     offset <- after + c(0L, cumsum(head(gap_sizes, n=-1L)))
     idx <- IRanges:::fancy_mseq(gap_sizes, offset=offset)
     ans[-idx] <- strsplit(x, "", fixed=TRUE)[[1L]]
     paste(ans, collapse="")
   }

   stretchRefSequence <- function(ref.sequence, sequences)
   {
     ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences)
     ins_after <- seq_len(nchar(ref.sequence) - 1L)
     .insertGaps(ref.sequence, ins_after, ins_sizes_in_ref)
   }

   stretchSequences <- function(ref.sequence, sequences)
   {
     ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences)
     sapply(sequences,
            function(seq) {
                ins_sizes <- .computeInsSizes(seq)
                naked_seq <- gsub("[\\[\\]]", "", seq, perl=TRUE)
                gap_sizes <- ins_sizes_in_ref - ins_sizes
                ins_after <- seq_along(ins_sizes_in_ref) + cumsum(ins_sizes)
                .insertGaps(naked_seq, ins_after, gap_sizes)
            },
            USE.NAMES=FALSE)
   }

Cheers,
H.

>
> When I apply your functions to this new data, I get the following errors:
>
>  > stretchRefSequence(ref.sequence, as.character(sequences[,2]))
> Error in computeInsSizesInRef(ref.sequence, sequences) :
>    'sequences[1]' is not compatible with 'ref.sequence'
>
>  > stretchSequences(ref.sequence, as.character(sequences[,2]))
> Error in computeInsSizesInRef(ref.sequence, sequences) :
>    'sequences[1]' is not compatible with 'ref.sequence'
>
> I tried deleting the following lines of code from your script:
>
>   if (gsub("\\[.\\]", "", seq) != ref.sequence)
> stop("'sequences[", i, "]' is not compatible with 'ref.sequence'")
>
> Reapplying the functions, I get:
>
>  > stretchRefSequence(ref.sequence, as.character(sequences[,2]))
> [1] "GC-AAA-TT"
>  > #not correct
>
>  > stretchSequences(ref.sequence, as.character(sequences[,2]))
> [1] "GC-AAT-T"  "GC-AAATTT" "GC-AAA-GT"
> Warning messages:
> 1: In .Method(..., na.rm = na.rm) :
>    an argument will be fractionally recycled
> 2: In ins_sizes_in_ref - ins_sizes :
>    longer object length is not a multiple of shorter object length
> 3: In seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) :
>    longer object length is not a multiple of shorter object length
>  > #also not correct
>
> Is there a way to modify your codes to make my new toy data give me the
> answers I'm looking for? Again, I really appreciate your help, because
> this parsing syntax is way over my head at the moment.
>
> -----------------------------------
> Josh Banta, Ph.D
> Assistant Professor
> Department of Biology
> The University of Texas at Tyler
> Tyler, TX 75799
> Tel: (903) 565-5655
> http://plantevolutionaryecology.org
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319