[Bioc-sig-seq] Biostrings: problem to access indel-details form pairwiseAlignment()

Wolfgang Raffelsberger wraff at igbmc.fr
Fri Jul 24 16:35:40 CEST 2009


Patrick,

thank for your quick response !
For a while things went well but now I got across a case where I have 
difficulty interpreting the number(s) given for start-positions of 
deletions. This is when there is a preceding insertion before a deletion 
occurs/ gets encountered.
Of course I could correct this, ie shift the values myself according to 
the number preceding inserted nucletides, but I posted this since I'm 
not sure if you're aware of this special case/problem.
Or maybe you have some other command/way to access (ie extract) the 
sequence(s) deleted ?

(Note that the equivalent approach for extracting insertions works 
without any problems :))

Wolfgang

 > suppressMessages(library(Biostrings))
 > mySubject <- ref1 <- DNAString("GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC")
 > myPattern <-  samp1 <- 
DNAStringSet(c("GGGATAGTGACTACAGGATCCAGCTCTTCGCCTGGC","ATAGTGACGTAGGATCACAGCTCTTCGCCGGC","TTGGGAGTGACTTGCAGGATCCAGTCTTCGCCTGGCAT")) 

 > ## 1st has a mutation;  2nd: {5'del+ mut(g)+ del(C)+ del(T)+ ins(A)} 
; 3rd: {5'ins + del(TA)+ ins(G)+ del(C)+ 3'ins}
 > ##  multiple alignment of test-case (gapOpening= -4, gapExtension= -2)
 > ##     GGGATAGTGACTT-CAGGATC-CAGCTCTTCGCCTGGC    .. ref = subject
 > ##     GGGATAGTGACTa-CAGGATC-CAGCTCTTCGCCTGGC    .. (1) mutation (T 
-> A)
 > ##        ATAGTGACgT--AGGATCACAGCTCTTCGCC-GGC    .. (2) 5'del + 
mut(g) + del(C)+ ins(A)+ del(T)
 > ##   ttGGGA--GTGACTTGCAGGATC-CAG-TCTTCGCCTGGCat  .. (3) 5'ins + 
del(TA)+ ins(G)+ del(C)+ 3'ins
 >
 > align <- pairwiseAlignment(myPattern,mySubject,gapOpening= -4, 
gapExtension= -2)
 >
 > nindel(align)                             # no of indels per seq
An object of class "InDel"
Slot "insertion":
    Length WidthSum
[1,]      0        0
[2,]      1        1
[3,]      1        1

Slot "deletion":
    Length WidthSum
[1,]      0        0
[2,]      2        2
[3,]      2        3

 > (delStart <- start(indel(pattern(align))))
[1] 11 30  5 23
 > (delEnd <- end(indel(pattern(align))))
[1] 11 30  6 23
 >
 > (seqNo_wDel <- rep.int(1:length(align), 
elementLengths(indel(pattern(align)))))  # sequence-number for each 
instance of deletion
[1] 2 2 3 3
 >
 > for(i in 1:length(seqNo_wDel)) 
print(substr(as.character(unaligned(subject(align[seqNo_wDel[i]]))),delStart[i],delEnd[i])) 

[1] "C"
[1] "G"
[1] "TA"
[1] "G"
## however the "true" deleted ones are : C, T, TA, C   => problem with 
deletions after preceding insertion (ie ech 2nd case)
 > for(i in 1:length(seqNo_wDel)) print(substr(as.character(  
aligned(pattern(align[seqNo_wDel[i]]))),delStart[i],delEnd[i]))
[1] "-"
[1] "C"
[1] "--"
[1] "A"
 > ## ==> problem with pôsition of deletions after preceding insertion 
(ie ech 2nd case)
 >
 > sessionInfo()
R version 2.10.0 Under development (unstable) (2009-07-22 r48979)
x86_64-unknown-linux-gnu

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base   
other attached packages:
[1] Biostrings_2.13.27 IRanges_1.3.42  
loaded via a namespace (and not attached):
[1] Biobase_2.5.4
 >




Patrick Aboyoun a écrit :
> Wolfgang,
> The IRanges infrastructure has evolved greatly from BioC 2.4/R-2.9 to 
> BioC 2.5/R-2.10 and I highly recommend you use accessor functions, 
> rather than direct slot access to obtain the information you are 
> looking for. For example, the IRangesList class definition has changed 
> greatly and the @elements slot is no longer present.
>
> Question: how to transform an IRangesList object to a list of IRanges 
> object
> Answer:
> Short answer, use as.list as in as.list(indel(subject(align))). Long 
> answer, this is typically not necessary and not desired given that 
> IRangesList has many methods defined for them. See man pages for 
> IRangesList, RangesList, and Sequence (if you are on BioC 2.5/R-2.10). 
> If you can tell me what operations you would like to perform, I can 
> point you in the right direction.
>
> Question: how to the the start, width, and end locations of a pairwise 
> alignment.
> Answer: use the start, width and end methods.
> > start(subject(align))
> [1] 1 1 4
> > width(subject(align))
> [1] 24 24 19
> > end(subject(align))
> [1] 24 24 22
>
>
> Patrick
>
>
> Wolfgang Raffelsberger wrote:
>> Patrick,
>>
>> thank you very much for your quick and helpful answer !
>>
>> Yes, using  :
>> > align <- pairwiseAlignment(samp1,ref1)
>> > indel(subject(align))
>>
>> I'm about to get what I'm looking for.  Now my question is, which 
>> commands are (will be) availabel for mining an IRangesList-object.
>> Most of all I'm interested in what would correspond to getting :
>>
>> > indel(subject(align))@elements
>> > subject(align)@range at start
>> > subject(align)@range at witdth   # in fact, so far I can do without 
>> this one
>>
>> (unless you think the @elements, and  @range won't change in the 
>> future ...)
>> With these elements I manage now to extract the very nucleotides that 
>> were inserted/deleted.
>>
>> Wolfgang
>>
>>
>> Patrick Aboyoun a écrit :
>>> Wolfgang,
>>> Below is code that retrieves the indel locations you are looking 
>>> for. I like your attempts at using indel, insertion, and deletion 
>>> for PairwiseAlignment objects and I'll add the methods for 
>>> PairwiseAlignment objects to BioC 2.5 (devel) shortly using the 
>>> conventions that I specify below.
>>>
>>> > suppressMessages(library(Biostrings))
>>> > ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
>>> > samp1 <- 
>>> DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG")) 
>>>
>>> > # 1st has a mutation, 2nd has an insertion, the 3rd is simply 
>>> shorter ...
>>> >
>>> > align <- pairwiseAlignment(samp1,ref1)
>>> >
>>> > nindel(align)
>>> An object of class “InDel”
>>> Slot "insertion":
>>> Length WidthSum
>>> [1,] 0 0
>>> [2,] 1 1
>>> [3,] 0 0
>>>
>>> Slot "deletion":
>>> Length WidthSum
>>> [1,] 0 0
>>> [2,] 0 0
>>> [3,] 0 0
>>>
>>> > deletions <- indel(pattern(align))
>>> > deletions
>>> CompressedIRangesList: 3 elements
>>> > insertions <- indel(subject(align))
>>> > insertions
>>> CompressedIRangesList: 3 elements
>>> > insertions[[2]]
>>> IRanges instance:
>>> start end width
>>> [1] 10 10 1
>>> > sessionInfo()
>>> R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
>>> i386-apple-darwin9.7.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.13.26 IRanges_1.3.41
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.5.4
>>>
>>>
>>> Wolfgang Raffelsberger wrote:
>>>> Dear list,
>>>>
>>>> previously I've been extracting indel-information from sequences 
>>>> aligned by the Biostrings function pairwiseAlignment(), which is 
>>>> probably not the best way since the class 
>>>> 'PairwiseAlignedFixedSubject' has evoled & changed and my old code 
>>>> won't work any more. Now trying to use the library-provided 
>>>> functions to access the information/details about indels (ie their 
>>>> localization on the pattern and possibly the indel sequence ). 
>>>> However, I can't find a function to extract this information, that 
>>>> is (to the best of my knowledge) part of the aligned object.
>>>>
>>>> ## here an example :
>>>> library(Biostrings)
>>>> ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
>>>> samp1 <- 
>>>> DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG")) 
>>>>
>>>> # 1st has a mutation, 2nd has an insertion, the 3rd is simply 
>>>> shorter ...
>>>>
>>>> align <- pairwiseAlignment(samp1,ref1)
>>>>
>>>> nindel(align) # insertion was found properly but I can't see at 
>>>> which nt position the indel was found (neither if it's an insertion 
>>>> or deletion)
>>>> indel(align) # Error in function (classes, fdef, mtable) unable to 
>>>> find an inherited method for function...
>>>> insertion(align) # Error in function (classes, fdef, mtable) unable 
>>>> to find an inherited method for function ...
>>>> deletion(align) # neither ...
>>>> ?AlignedXStringSet # says under 'Accessor methods' that indel() 
>>>> exists ..
>>>>
>>>> ## ideally I'd be looking for something like
>>>> mismatchTable(align) # but addressing indels ...
>>>>
>>>>
>>>> ## for completeness :
>>>> > sessionInfo()
>>>> R version 2.9.1 (2009-06-26)
>>>> i386-pc-mingw32
>>>>
>>>> locale:
>>>> LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252 
>>>>
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>> other attached packages:
>>>> [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3 
>>>> Biostrings_2.12.7 IRanges_1.2.3
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
>>>>
>>>> Thank's in advance,
>>>> Wolfgang Raffelsberger
>>>>
>>>>
>

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC,  
1 rue Laurent Fries,  67404 Illkirch  Strasbourg,  France
Tel (+33) 388 65 3300         Fax (+33) 388 65 3276
wolfgang.raffelsberger (at) igbmc.fr



More information about the Bioc-sig-sequencing mailing list