[BioC] Longest continuous sequence from multiple alignment

Mon May 5 19:02:34 CEST 2014

I have a number of ShortRead sequences aligned and stored in a DNAMultipleAlignment object. Iâ€™m trying to find a way to extract the common longest sequence between the reads. i.e., it is a combination of consensus sequence and longest read

here is an example of 6 reads that I have aligned:

[1] "CTATGGGAGGGAAAAAGGATCCAGTATTAGGGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACA-------------------------------------------------------------"
[2] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGC-AAAAAAAACATGGTGACTGCAAAGAG-----------------"
[3] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGC--------------------------------------------"
[4] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAAC--------------------------------------------------------------"
[5] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGCAAAAAAAAACATGGTGACTGCAAAGAGAGCTCTGTCTGGCTTCTâ€
[6] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAAC--------------------------------------------------------------"

>From these, I would like to go extract the longest common sequence, but retain ambiguity information where available i.e producing the following sequence: 

 "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGC-AAAAAAAACATGGTGACTGCAAAGAGAGCTCTGTCTGGCTTCTâ€

The consensusString function does not work well for me as it truncates the sequence in the 3â€™ end in this example. 

Thank you for your help

All the best

Tomas

 -- output of sessionInfo(): 

R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Rsubread_1.14.0         Rlibstree_0.3-2         BiocInstaller_1.14.2    muscle_3.8.31-1         ShortRead_1.22.0        GenomicAlignments_1.0.1
 [7] BSgenome_1.32.0         Rsamtools_1.16.0        GenomicRanges_1.16.3    GenomeInfoDb_1.0.2      Biostrings_2.32.0       XVector_0.4.0          
[13] IRanges_1.22.6          BiocParallel_0.6.0      BiocGenerics_0.10.0    

loaded via a namespace (and not attached):
 [1] BatchJobs_1.2       BBmisc_1.6          Biobase_2.24.0      bitops_1.0-6        brew_1.0-6          codetools_0.2-8     DBI_0.2-7          
 [8] digest_0.6.4        fail_1.2            foreach_1.4.2       grid_3.1.0          hwriter_1.3         iterators_1.0.7     lattice_0.20-29    
[15] latticeExtra_0.6-26 plyr_1.8.1          RColorBrewer_1.0-5  Rcpp_0.11.1         RSQLite_0.11.4      sendmailR_1.1-2     stats4_3.1.0       
[22] stringr_0.6.2       tools_3.1.0         zlibbioc_1.10.0    

--
Sent via the guest posting facility at bioconductor.org.