[BioC] matchTable for pairwiseAlignment in Biostrings?
Valerie Obenchain
vobencha at fhcrc.org
Fri Aug 30 17:42:54 CEST 2013
This response from Herve didn't get posted - email troubles while traveling.
Valerie
-------- Original Message --------
Subject: Re: [BioC] matchTable for pairwiseAlignment in Biostrings?
Date: Thu, 22 Aug 2013 23:58:51 -0700
From: Hervé Pagès <hpages at fhcrc.org>
To: Brian Herb <brianherb at jhmi.edu>
CC: Valerie Obenchain <vobencha at fhcrc.org>, bioconductor at r-project.org
Hi Brian,
Global-local alignments can also be described in terms of position
and cigar (POS and CIGAR fields in the SAM/BAM formats).
In your example:
POS = 5
CIGAR = 11M1D10M1I7M4I16M (generated by hand, sorry)
The advantage of the POS + CIGAR representation is that one can make
use of the cigar utilities in GenomicRanges to implement something
equivalent to your matchTable() function:
library(GenomicRanges)
matchTable2 <- function(pos, cigar)
{
pat_ranges <- cigarRangesAlongQuerySpace(cigar, with.ops=TRUE)[[1]]
sub_ranges <- cigarRangesAlongReferenceSpace(cigar, pos=pos,
with.ops=TRUE)[[1]]
ops <- names(pat_ranges) # guaranteed to be the same as
names(sub_ranges)
I_idx <- which(ops == "I")
start_idx <- c(1L, I_idx + 1L)
end_idx <- c(I_idx - 1L, length(pat_ranges))
PatStart <- start(pat_ranges)[start_idx]
PatEnd <- end(pat_ranges)[end_idx]
SubStart <- start(sub_ranges)[start_idx]
SubEnd <- end(sub_ranges)[end_idx]
data.frame(PatStart, PatEnd, SubStart, SubEnd)
}
Then:
> matchTable2(5, "11M1D10M1I7M4I16M")
PatStart PatEnd SubStart SubEnd
1 1 21 5 26
2 23 29 27 33
3 34 49 34 49
So it sounds like it would maybe be helpful if we had a cigar() getter
that computes the cigars from a PairwiseAlignments object of type
"global-local". Once we have this, we could even support exporting
this kind of objects to the SAM/BAM format.
Cheers,
H.
On 08/22/2013 02:01 PM, Brian Herb wrote:
> Hi Valerie,
>
> Thank you for looking into this! I actually didn't see my post show up
> so I did a little more tinkering and came up with the following
> function, see what you think:
>
> (input is PairwiseAlignmentsSingleSubject)
>
> matchTable <- function(pwa) {
> p.start <- start(pattern(pwa))
> p.end <- end(pattern(pwa))
> s.start <- start(subject(pwa))
> s.end <- end(subject(pwa))
> inserts <- as.matrix(insertion(pwa)[[1]])
> dels <- as.matrix(deletion(pwa)[[1]])
> s.pos1 <- c(s.start,(s.start+inserts[,1]-1))
> p.pos1 <- c(p.start,(p.start + cumsum(inserts[,2])+cumsum(diff(s.pos1))))
> s.pos2=c(s.pos1[-1]-1,s.end)
> p.pos2=c((head(p.pos1,n=(length(p.pos1)-1))+diff(s.pos1)-1),p.end)
>
> for(i in 1:nrow(dels)){
> tmp=dels[i,]
> p.pos1[which(p.pos1>tmp[1])]=p.pos1[which(p.pos1>tmp[1])]-tmp[2]
> p.pos2[which(p.pos2>tmp[1])]=p.pos2[which(p.pos2>tmp[1])]-tmp[2]
> }
>
> p.pos2[length(p.pos2)]=p.end
>
> tot=cbind(p.pos1,p.pos2,(p.pos2-p.pos1),s.pos1,s.pos2,(s.pos2-s.pos1))
> colnames(tot)=c("PatStart","PatEnd","PatDist","SubStart","SubEnd","SubDist")
> tot
> }
>
> example:
>
> > tmp1="GTATCGATACGTCGATCGTCGATGCAGCAGCAGACTGATGCATGACTGC"
>
> > tmp2="CACTCGATCGATGCTCTACATGATCGTGCAGCAACTGATGCATGACTGC"
>
> > test=pairwiseAlignment(pattern=tmp1,subject=tmp2,type="global-local")
>
> > matchTable(test)
> PatStart PatEnd PatDist SubStart SubEnd SubDist
> [1,] 1 21 20 5 26 21
> [2,] 23 29 6 27 33 6
> [3,] 34 49 15 34 49 15
> For my work I am aligning longer sequences (~1kb) and I care more about
> where chunks of these sequences align rather than exactly which bases
> match within these chunks. Since I'm comparing sequences from different
> species, I'm trying to find where sequence from one species matches to
> the other, allowing for insertions and deletions. The function could
> probably be written better, but that is the hazard of letting biologists
> learn R. Thanks again for your help.
>
> writePairwiseAlignment output is attached
>
>
>
>
> On Thu, Aug 22, 2013 at 4:38 PM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
> Hi Brian,
>
> I don't know of a function that creates such a table but I think you
> can create this with the start() and end() accessors.
>
> pat <- DNAStringSet(c("CACCAGCT", "TTCGCC", "CGGTC"))
> sub <- DNAString("__ACTTCACCAGCTGACCTCG")
> res <- pairwiseAlignment(pat, sub)
>
> > res
> Global PairwiseAlignmentsSingleSubjec__t (1 of 3)
> pattern: [1] CACCAGCT
> subject: [5] CACCAGCT
> score: -48.14596
>
> DataFrame(subjectStart=start(__subject(res)),
> subjectEnd=end(subject(res)),
> patternStart=start(pattern(__res)),
> patternEnd=end(pattern(res)))
>
> DataFrame with 3 rows and 4 columns
> subjectStart subjectEnd patternStart patternEnd
> <integer> <integer> <integer> <integer>
> 1 5 12 1 8
> 2 3 8 1 6
> 3 1 5 1 5
>
>
> Valerie
>
>
> On 08/21/2013 08:40 AM, Brian Herb wrote:
>
> Hi all,
>
> I feel like I'm missing some obvious function, but is there a
> simple way to
> output a table of matching start and end positions between the
> subject and
> pattern in a pairwiseAlignment result (say from a global-local
> alignment)?
> For example:
>
> s.pos1 s.pos2 p.pos1 p.pos2
> [1,] 354 359 1 6
> [2,] 360 364 9 19
>
> where s.pos1 and s.pos2 are the start and end of a section of
> the subject
> that corresponds to the start and end (p.pos1 and p.pos2) of
> the pattern,
> taking into account insertions and deletions. I know the
> insertion and
> deletion information is easily available, but I haven't found a
> function in
> the literature that that summarizes them into a mummer like
> output of
> corresponding start and end positions.
>
>
>
>
>
>
>
> --
> Brian Herb, Ph.D.
> Johns Hopkins School of Medicine
> Dr. Andrew Feinberg Laboratory
> Rangos 580
> 855 N. Wolfe St.
> Baltimore, MD 21205
> Phone:410-614-3478
> Fax: 410-614-9819
More information about the Bioconductor
mailing list