[BioC] Biostrings DNAMultipleAlignment vs AlignedXStringSet
Martin Morgan
mtmorgan at fhcrc.org
Fri Aug 31 21:26:28 CEST 2012
Hi Nathan --
On 08/31/2012 08:10 AM, Nathan Sheffield [guest] wrote:
>
> Hi, I have what I thought should be simple procedure...
>
> I have a multiple alignment. I want to find the locations (and
> sequences) of all the differences.
>
> I can read the file into a DNAMultipleAlignment class:
>
> aln = read.DNAMultipleAlignment(filepath=alnFiles[i],
> format="fasta")
>
> This is a 4-species alignment, 300bp long. I finally figured out a
> way find all the "differences" (or, bases where the 4 species are not
> identical), like so:
>
> which(apply(consensusMatrix(aln), 2, max) !=4)
>
> It seems like there should be a more straightforward way, but this
> just looks at the count matrix and finds all the places that don't
> have all the species in the same nucleotide...
For a reproducible example, I grabbed
aln <-
readDNAMultipleAlignment(filepath =
system.file("extdata",
"msx2_mRNA.aln",
package="Biostrings"),
format="clustal")
from the help page, ?MultipleAlignment. This has 8 rows.
I did
cidx = colSums(consensusMatrix(aln) == nrow(aln)) == 1
to get a logical vector, but same idea as your 'apply' solution. I ended
up with
> table(cidx)
cidx
FALSE TRUE
2024 319
so 2024 columns that are not consistent across all rows.
> Now, I found no easy way to subset the aln object to pull out the
> sequences at those positions. Is there one?
I applied a mask to the columns
colmask(aln, "replace") = cidx
and then coerced my aln to a DNAStringSet
> as(aln, "DNAStringSet")
A DNAStringSet instance of length 8
width seq names
[1] 2024 -----TCCCGTCTCCGCA...TAAAAAAAAAAAAAAAAA gi|84452153|ref|N...
[2] 2024 ------------------...------------------ gi|208431713|ref|...
[3] 2024 ------------------...------------------ gi|118601823|ref|...
[4] 2024 ------------------...------------------ gi|114326503|ref|...
[5] 2024 ------------------...------------------ gi|119220589|ref|...
[6] 2024 ------------------...------------------ gi|148540149|ref|...
[7] 2024 --------------CGGC...------------------ gi|45383056|ref|N...
[8] 2024 GGGGGAGACTTCAGAAGT...------------------ gi|213515133|ref|...
Also, I coerced the logical vector to an IRanges
> (crng = as(!cidx, "IRanges"))
IRanges of length 166
start end width
[1] 1 98 98
[2] 100 109 10
[3] 111 120 10
[4] 122 122 1
[5] 124 127 4
[6] 129 142 14
[7] 144 155 12
[8] 157 159 3
[9] 161 166 6
... ... ... ...
[158] 876 876 1
[159] 878 882 5
[160] 885 886 2
[161] 888 889 2
[162] 892 892 1
[163] 894 896 3
[ reached getOption("max.print") -- omitted 3 rows ]
This allowed me to look at one range of non-consensus at a time, e.g.,
> colmask(aln, "replace", invert=TRUE) = crng[1]
> as(aln, "DNAStringSet")
A DNAStringSet instance of length 8
width seq names
[1] 98 -----TCCCGTCTCCGCA...GGGCAGAGAAGTCA-TGG gi|84452153|ref|N...
[2] 98 ------------------...-------------A-TGG gi|208431713|ref|...
[3] 98 ------------------...---GAGAGAAGTCA-TGG gi|118601823|ref|...
[4] 98 ------------------...GCGCAGA-AAGTCA-TGG gi|114326503|ref|...
[5] 98 ------------------...-------------A-TGG gi|119220589|ref|...
[6] 98 ------------------...-------------A-TGG gi|148540149|ref|...
[7] 98 --------------CGGC...GGGCGGCCCCGCTC-CAG gi|45383056|ref|N...
[8] 98 GGGGGAGACTTCAGAAGT...GAATGTGTTCGTCAACAT gi|213515133|ref|...
I haven't worked with these much, so I might not be doing things in the
most efficient way; mostly I'm going from ?MultipleAlignment; there is
also vignette(package="Biostrings", "MultipleAlignments").
>
> I think maybe I want something like "mismatchTable", but according to
> the documentation, this won't work on a DNAMultipleAlignment...
>
> mismatchTable(aln) Error in function (classes, fdef, mtable) :
> unable to find an inherited method for function "mismatchTable", for
> signature "DNAMultipleAlignment"
>
> Instead, it works on a AlignedXStringSet. Well, I can't figure out
> the difference between an AlignedXStringSet and a
> DNAMultipleAlignment, which both appear to hold aligned strings, even
> XStringSets...and I cannot convert one to another.
>
> Can anyone tell me where I should start to figure out the differences
> between these objects? And, how to solve my original problem of
> getting the the sequences at the locations that are different in an
> alignment?
I think the AlignedXStringSet is strictly for pairwise alignments
returned by the ?pairwiseAlignment function.
Hope that helps,
Martin
>
>
> Thanks for your help.
>
>
>
> -- output of sessionInfo():
>
> R version 2.13.0 (2011-04-13) Platform: x86_64-redhat-linux-gnu
> (64-bit)
>
> locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3]
> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C
> LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9]
> LC_ADDRESS=C LC_TELEPHONE=C [11]
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages: [1] grid stats graphics grDevices
> utils datasets methods [8] base
>
> other attached packages: [1] seqLogo_1.18.0 MotIV_1.6.0
> rtracklayer_1.12.2 [4] RCurl_1.6-4 bitops_1.0-4.1
> GenomicRanges_1.4.3 [7] Biostrings_2.20.1 IRanges_1.10.3
>
> loaded via a namespace (and not attached): [1] BSgenome_1.20.0
> lattice_0.19-23 rGADEM_1.4.0 tools_2.13.0 [5] XML_3.4-0
>
>
> -- Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________ Bioconductor mailing
> list Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list