[BioC] Parse sequences so they align properly

Sun Jan 5 08:38:17 CET 2014

Hi Martin,

On 01/04/2014 03:14 PM, Martin Morgan wrote:
> On 01/04/2014 02:05 PM, Joshua Banta wrote:
>> Dear Hervé,
>>
>> Thank you very much for your response.
>>
>> Your script has a line of code that reads:
>>
>>   if (gsub("\\[.\\]", "", seq) != ref.sequence)
>>    stop("'sequences[", i, "]' is not compatible with 'ref.sequence'")
>>
>> I take it this means that your functions do not allow for
>> polymorphisms and other variations. For instance:
>>
>> ref.sequence <- "GCAAATT"
>> sequences <- read.table(textConnection("name sequence
>> sequence1 GC-AATT
>> sequence2 GCAAAT[T]T
>> sequence3 GCAAAGT"), header = TRUE, as.is<http://as.is> = TRUE)
>> closeAllConnections()
>>
>> Sequence1 has a deletion (relative to the reference sequence) at
>> position 3, sequence 2 has an insertion (relative to the reference
>> sequence) between positions 5 and 6, and sequence3 has a G instead of
>> a T (relative to the reference sequence) at position 6. (I am
>> numbering the positions from left to right, using the reference
>> sequence to assign the position numbers.)
>>
>> What I want to get is this:
>>
>> reference.sequence: "GCAAAT-T
>> sequence1: GC-AAT-T
>> sequence2: GCAAATTT
>> sequence3: GCAAAG-T
>
> I took a different approach, not super efficient. Here's the original
> data, modified so that [GCC] is replaced by [G][C][C], with a single
> vector of 'sequences', and the ref.sequence included as the first (I
> don't think it's handled specially, other than not having any insertions).
>
>      ref.sequence <- "ATAGCCGCA"
>      sequences <- c(ref.sequence,
>                     "AT[G][C][C]AGCCG[T]CA",
>                     "ATAGCCGC[C][A][C]A",
>                     "AT[G][C][C]AGCCGCA")
>
> I create two copies of the sequences, one with the gap characters
> replaced by '+' and another with the raw sequence (no [] marks)
>
>      gapped <- gsub("\\[.\\]", "+", sequences)
>      raw <- gsub("(\\[|\\])", "", sequences)
>
> I'm going to look at each letter in each 'gapped' sequence and decide
> whether it's a gap or not, and build up 'result' strings that are made
> of either an induced gap or the actual character at that position.
> Here's the index into the character positions of the gapped string, the
> result strings that I'm going to build up, and a stopping criterion
>
>      idx <- rep(1, length(gapped))
>      result <- character(length(gapped))
>      maxit <- nchar(gapped)
>
> Now I'll look at each character of each string. When there are
> insertions, I'll either add '-' or the inserted ('raw') character to the
> result string, and increment the index into the string(s) with the
> insertion.
>
>      repeat {
>          if (all(idx > maxit))
>              break
>          ins <- substr(gapped, idx, idx) == "+"
>          char <- substr(raw, idx, idx)
>          if (any(ins)) {
>              char <- ifelse(ins, char, "-")
>              idx <- ifelse(ins, idx + 1, idx)
>          } else
>              idx <- idx + 1
>          result <- sprintf("%s%s", result, char)
>      }
>
> and then print the result out
>
>      cat(paste(result, collapse="\n"), "\n")
>      ## AT---AGCCG-C---A
>      ## ATGCCAGCCGTC---A
>      ## AT---AGCCG-CCACA
>      ## ATGCCAGCCG-C---A
>
> or for your second example
>
>      cat(paste(result, collapse="\n"), "\n")
>      ## GCAAAT-T
>      ## GC-AAT-T
>      ## GCAAATTT
>      ## GCAAAG-T

Nice. Much simpler than my solution.

It's interesting to note that the main difference between your solution
and mine is that your main loop (your repeat statement) is on the
"columns" of the dataset while my main loop (my for and sapply
statements, because I actually make 2 passes) is on the "rows" of
the dataset (i.e. on the individual sequences).

As a consequence, your solution is faster than mine when the
dataset is made of a big number (typically thousands) of short
sequences (typically < 1000 chars) while mine is faster than yours
when the dataset is made of a small number (e.g. only a few hundreds)
of long sequences (typically > 10000 chars).

So an optimized pure R solution would require to integrate the 2
implementations and to add an internal switch to one or the other
depending on the shape of the dataset. Probably ugly enough to
justify that one of the 2 implementations is moved to the C level
and used in all situations (it'll be fast whatever the shape of the
dataset is).

H.

>
>>
>> When I apply your functions to this new data, I get the following errors:
>>
>>> stretchRefSequence(ref.sequence, as.character(sequences[,2]))
>> Error in computeInsSizesInRef(ref.sequence, sequences) :
>>    'sequences[1]' is not compatible with 'ref.sequence'
>>
>>> stretchSequences(ref.sequence, as.character(sequences[,2]))
>> Error in computeInsSizesInRef(ref.sequence, sequences) :
>>    'sequences[1]' is not compatible with 'ref.sequence'
>>
>> I tried deleting the following lines of code from your script:
>>
>>   if (gsub("\\[.\\]", "", seq) != ref.sequence)
>>    stop("'sequences[", i, "]' is not compatible with 'ref.sequence'")
>>
>> Reapplying the functions, I get:
>>
>>> stretchRefSequence(ref.sequence, as.character(sequences[,2]))
>> [1] "GC-AAA-TT"
>>> #not correct
>>
>>> stretchSequences(ref.sequence, as.character(sequences[,2]))
>> [1] "GC-AAT-T"  "GC-AAATTT" "GC-AAA-GT"
>> Warning messages:
>> 1: In .Method(..., na.rm = na.rm) :
>>    an argument will be fractionally recycled
>> 2: In ins_sizes_in_ref - ins_sizes :
>>    longer object length is not a multiple of shorter object length
>> 3: In seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) :
>>    longer object length is not a multiple of shorter object length
>>> #also not correct
>>
>> Is there a way to modify your codes to make my new toy data give me
>> the answers I'm looking for? Again, I really appreciate your help,
>> because this parsing syntax is way over my head at the moment.
>>
>> -----------------------------------
>> Josh Banta, Ph.D
>> Assistant Professor
>> Department of Biology
>> The University of Texas at Tyler
>> Tyler, TX 75799
>> Tel: (903) 565-5655
>> http://plantevolutionaryecology.org
>>
>>
>>     [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319