[BioC] matchTable for pairwiseAlignment in Biostrings?

Thu Aug 22 23:01:03 CEST 2013

Hi Valerie,

Thank you for looking into this! I actually didn't see my post show up so I
did a little more tinkering and came up with the following function, see
what you think:

(input is PairwiseAlignmentsSingleSubject)

matchTable <- function(pwa) {
p.start <- start(pattern(pwa))
p.end <- end(pattern(pwa))
s.start <- start(subject(pwa))
s.end <- end(subject(pwa))
inserts <- as.matrix(insertion(pwa)[[1]])
dels <- as.matrix(deletion(pwa)[[1]])
s.pos1 <- c(s.start,(s.start+inserts[,1]-1))
p.pos1 <- c(p.start,(p.start + cumsum(inserts[,2])+cumsum(diff(s.pos1))))
s.pos2=c(s.pos1[-1]-1,s.end)
p.pos2=c((head(p.pos1,n=(length(p.pos1)-1))+diff(s.pos1)-1),p.end)

for(i in 1:nrow(dels)){
tmp=dels[i,]
p.pos1[which(p.pos1>tmp[1])]=p.pos1[which(p.pos1>tmp[1])]-tmp[2]
p.pos2[which(p.pos2>tmp[1])]=p.pos2[which(p.pos2>tmp[1])]-tmp[2]
}

p.pos2[length(p.pos2)]=p.end

tot=cbind(p.pos1,p.pos2,(p.pos2-p.pos1),s.pos1,s.pos2,(s.pos2-s.pos1))
colnames(tot)=c("PatStart","PatEnd","PatDist","SubStart","SubEnd","SubDist")
tot
}

example:

> tmp1="GTATCGATACGTCGATCGTCGATGCAGCAGCAGACTGATGCATGACTGC"

> tmp2="CACTCGATCGATGCTCTACATGATCGTGCAGCAACTGATGCATGACTGC"

> test=pairwiseAlignment(pattern=tmp1,subject=tmp2,type="global-local")

> matchTable(test)
     PatStart PatEnd PatDist SubStart SubEnd SubDist
[1,]        1     21      20        5     26      21
[2,]       23     29       6       27     33       6
[3,]       34     49      15       34     49      15

For my work I am aligning longer sequences (~1kb) and I care more about
where chunks of these sequences align rather than exactly which bases match
within these chunks. Since I'm comparing sequences from different species,
I'm trying to find where sequence from one species matches to the other,
allowing for insertions and deletions. The function could probably be
written better, but that is the hazard of letting biologists learn R.
Thanks again for your help.

writePairwiseAlignment output is attached

On Thu, Aug 22, 2013 at 4:38 PM, Valerie Obenchain <vobencha at fhcrc.org>wrote:

> Hi Brian,
>
> I don't know of a function that creates such a table but I think you can
> create this with the start() and end() accessors.
>
> pat <- DNAStringSet(c("CACCAGCT", "TTCGCC", "CGGTC"))
> sub <- DNAString("**ACTTCACCAGCTGACCTCG")
> res <- pairwiseAlignment(pat, sub)
>
> > res
> Global PairwiseAlignmentsSingleSubjec**t (1 of 3)
> pattern: [1] CACCAGCT
> subject: [5] CACCAGCT
> score: -48.14596
>
> DataFrame(subjectStart=start(**subject(res)),
>           subjectEnd=end(subject(res)),
>           patternStart=start(pattern(**res)),
>           patternEnd=end(pattern(res)))
>
> DataFrame with 3 rows and 4 columns
>   subjectStart subjectEnd patternStart patternEnd
>      <integer>  <integer>    <integer>  <integer>
> 1            5         12            1          8
> 2            3          8            1          6
> 3            1          5            1          5
>
>
> Valerie
>
>
> On 08/21/2013 08:40 AM, Brian Herb wrote:
>
>> Hi all,
>>
>> I feel like I'm missing some obvious function, but is there a simple way
>> to
>> output a table of matching start and end positions between the subject and
>> pattern in a pairwiseAlignment result (say from a global-local alignment)?
>> For example:
>>
>>    s.pos1 s.pos2 p.pos1 p.pos2
>>   [1,]    354    359      1      6
>>   [2,]    360    364      9     19
>>
>> where s.pos1 and s.pos2 are the start and end of a section of the subject
>> that corresponds to the start and end (p.pos1 and p.pos2)  of the pattern,
>> taking into account insertions and deletions. I know the insertion and
>> deletion information is easily available, but I haven't found a function
>> in
>> the literature that that summarizes them into a mummer like output of
>> corresponding start and end positions.
>>
>>
>
>
>

-- 
Brian Herb, Ph.D.
Johns Hopkins School of Medicine
Dr. Andrew Feinberg Laboratory
Rangos 580
855 N. Wolfe St.
Baltimore, MD 21205
Phone:410-614-3478
Fax: 410-614-9819
-------------- next part --------------
########################################
# Program: Biostrings (version 2.28.0), a Bioconductor package
# Rundate: Thu Aug 22 16:49:37 2013
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: P1
# 2: S1
# Matrix: NA
# Gap_penalty: 14.0
# Extend_penalty: 4.0
#
# Length: 50
# Identity:      36/50 (72.0%)
# Similarity:    NA/50 (NA%)
# Gaps:           6/50 (12.0%)
# Score: -29.85107
#
#
#=======================================

P1                 1 GTATCGATACG-TCGATCGTCGATGCAGCAGCAGACTGATGCATGACTGC     49
                       |||||| |  |  ||  ||| |||||||    ||||||||||||||||
S1                 5 CGATCGATGCTCTACATGATCG-TGCAGCA----ACTGATGCATGACTGC     49

#---------------------------------------
#---------------------------------------