[BioC] Parse sequences so they align properly
Hervé Pagès
hpages at fhcrc.org
Sun Jan 5 07:18:17 CET 2014
Hi Joshua,
On 01/04/2014 02:05 PM, Joshua Banta wrote:
> Dear Hervé,
>
> Thank you very much for your response.
>
> Your script has a line of code that reads:
>
> if (gsub("\\[.\\]", "", seq) != ref.sequence)
> stop("'sequences[", i, "]' is not compatible with 'ref.sequence'")
>
> I take it this means that your functions do not allow for polymorphisms
> and other variations. For instance:
>
> ref.sequence <- "GCAAATT"
> sequences <- read.table(textConnection("name sequence
> sequence1 GC-AATT
> sequence2 GCAAAT[T]T
> sequence3 GCAAAGT"), header = TRUE, as.is <http://as.is> = TRUE)
> closeAllConnections()
>
> Sequence1 has a deletion (relative to the reference sequence) at
> position 3, sequence 2 has an insertion (relative to the reference
> sequence) between positions 5 and 6, and sequence3 has a G instead of a
> T (relative to the reference sequence) at position 6. (I am numbering
> the positions from left to right, using the reference sequence to assign
> the position numbers.)
>
> What I want to get is this:
>
> reference.sequence: "GCAAAT-T
> sequence1: GC-AAT-T
> sequence2: GCAAATTT
> sequence3: GCAAAG-T
Based on the description you gave of your data, it seemed that your
sequences only had insertions with respect to the reference sequence.
So the code I provided only supported insertions. Below I modified
the original code to support insertions, deletions, and replacements.
I also fixed a bug in the .insertGaps() helper function when there is
nothing to insert.
library(IRanges) # for the Rle(), runLength(), isSingleString(),
# and IRanges:::fancy_mseq() functions
.isCompatibleWithRef <- function(sequence, ref.sequence)
{
seq0 <- gsub("\\[.\\]", "", sequence)
nchar(seq0) == nchar(ref.sequence)
}
.computeInsSizes <- function(sequence)
{
tmp <- strsplit(gsub("\\[.\\]", "i", sequence), "", fixed=TRUE)[[1L]]
head(runLength(Rle(cumsum(tmp != "i"))) - 1L, n=-1L)
}
computeInsSizesInRef <- function(ref.sequence, sequences)
{
if (!isSingleString(ref.sequence))
stop("'ref.sequence' must be a single string")
if (!is.character(sequences))
stop("'sequences' must be a character vector")
ans <- integer(nchar(ref.sequence) - 1L)
for (i in seq_along(sequences)) {
seq <- sequences[i]
if (!.isCompatibleWithRef(seq, ref.sequence))
stop("'sequences[", i, "]' is not compatible with
'ref.sequence'")
ins_sizes <- .computeInsSizes(seq)
ans <- pmax(ans, ins_sizes)
}
ans
}
.insertGaps <- function(x, after, gap_sizes)
{
sum_gap_sizes <- sum(gap_sizes)
if (sum_gap_sizes == 0L)
return(x)
ans_len <- nchar(x) + sum_gap_sizes
ans <- character(ans_len)
ans[] <- "-"
offset <- after + c(0L, cumsum(head(gap_sizes, n=-1L)))
idx <- IRanges:::fancy_mseq(gap_sizes, offset=offset)
ans[-idx] <- strsplit(x, "", fixed=TRUE)[[1L]]
paste(ans, collapse="")
}
stretchRefSequence <- function(ref.sequence, sequences)
{
ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences)
ins_after <- seq_len(nchar(ref.sequence) - 1L)
.insertGaps(ref.sequence, ins_after, ins_sizes_in_ref)
}
stretchSequences <- function(ref.sequence, sequences)
{
ins_sizes_in_ref <- computeInsSizesInRef(ref.sequence, sequences)
sapply(sequences,
function(seq) {
ins_sizes <- .computeInsSizes(seq)
naked_seq <- gsub("[\\[\\]]", "", seq, perl=TRUE)
gap_sizes <- ins_sizes_in_ref - ins_sizes
ins_after <- seq_along(ins_sizes_in_ref) + cumsum(ins_sizes)
.insertGaps(naked_seq, ins_after, gap_sizes)
},
USE.NAMES=FALSE)
}
Cheers,
H.
>
> When I apply your functions to this new data, I get the following errors:
>
> > stretchRefSequence(ref.sequence, as.character(sequences[,2]))
> Error in computeInsSizesInRef(ref.sequence, sequences) :
> 'sequences[1]' is not compatible with 'ref.sequence'
>
> > stretchSequences(ref.sequence, as.character(sequences[,2]))
> Error in computeInsSizesInRef(ref.sequence, sequences) :
> 'sequences[1]' is not compatible with 'ref.sequence'
>
> I tried deleting the following lines of code from your script:
>
> if (gsub("\\[.\\]", "", seq) != ref.sequence)
> stop("'sequences[", i, "]' is not compatible with 'ref.sequence'")
>
> Reapplying the functions, I get:
>
> > stretchRefSequence(ref.sequence, as.character(sequences[,2]))
> [1] "GC-AAA-TT"
> > #not correct
>
> > stretchSequences(ref.sequence, as.character(sequences[,2]))
> [1] "GC-AAT-T" "GC-AAATTT" "GC-AAA-GT"
> Warning messages:
> 1: In .Method(..., na.rm = na.rm) :
> an argument will be fractionally recycled
> 2: In ins_sizes_in_ref - ins_sizes :
> longer object length is not a multiple of shorter object length
> 3: In seq_along(ins_sizes_in_ref) + cumsum(ins_sizes) :
> longer object length is not a multiple of shorter object length
> > #also not correct
>
> Is there a way to modify your codes to make my new toy data give me the
> answers I'm looking for? Again, I really appreciate your help, because
> this parsing syntax is way over my head at the moment.
>
> -----------------------------------
> Josh Banta, Ph.D
> Assistant Professor
> Department of Biology
> The University of Texas at Tyler
> Tyler, TX 75799
> Tel: (903) 565-5655
> http://plantevolutionaryecology.org
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list