[BioC] Longest continuous sequence from multiple alignment
Tomas Bjorklund [guest]
guest at bioconductor.org
Mon May 5 19:02:34 CEST 2014
I have a number of ShortRead sequences aligned and stored in a DNAMultipleAlignment object. Iâm trying to find a way to extract the common longest sequence between the reads. i.e., it is a combination of consensus sequence and longest read
here is an example of 6 reads that I have aligned:
[1] "CTATGGGAGGGAAAAAGGATCCAGTATTAGGGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACA-------------------------------------------------------------"
[2] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGC-AAAAAAAACATGGTGACTGCAAAGAG-----------------"
[3] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGC--------------------------------------------"
[4] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAAC--------------------------------------------------------------"
[5] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGCAAAAAAAAACATGGTGACTGCAAAGAGAGCTCTGTCTGGCTTCTâ
[6] "CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAAC--------------------------------------------------------------"
>From these, I would like to go extract the longest common sequence, but retain ambiguity information where available i.e producing the following sequence:
"CTATGGGAGGG-AAAAGGATCCAGTATTA-GGGGAGGCAGGAAGTGCATGTGAGGCCACAGTGTCAACAACATACTTTCACTTGGC-AAAAAAAACATGGTGACTGCAAAGAGAGCTCTGTCTGGCTTCTâ
The consensusString function does not work well for me as it truncates the sequence in the 3â end in this example.
Thank you for your help
All the best
Tomas
-- output of sessionInfo():
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_1.14.0 Rlibstree_0.3-2 BiocInstaller_1.14.2 muscle_3.8.31-1 ShortRead_1.22.0 GenomicAlignments_1.0.1
[7] BSgenome_1.32.0 Rsamtools_1.16.0 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 Biostrings_2.32.0 XVector_0.4.0
[13] IRanges_1.22.6 BiocParallel_0.6.0 BiocGenerics_0.10.0
loaded via a namespace (and not attached):
[1] BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0 bitops_1.0-6 brew_1.0-6 codetools_0.2-8 DBI_0.2-7
[8] digest_0.6.4 fail_1.2 foreach_1.4.2 grid_3.1.0 hwriter_1.3 iterators_1.0.7 lattice_0.20-29
[15] latticeExtra_0.6-26 plyr_1.8.1 RColorBrewer_1.0-5 Rcpp_0.11.1 RSQLite_0.11.4 sendmailR_1.1-2 stats4_3.1.0
[22] stringr_0.6.2 tools_3.1.0 zlibbioc_1.10.0
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list