[BioC] Biostrings DNAMultipleAlignment vs AlignedXStringSet

Fri Aug 31 17:10:50 CEST 2012

Hi, I have what I thought should be simple procedure... 

I have a multiple alignment. I want to find the locations (and sequences) of all the differences.

I can read the file into a DNAMultipleAlignment class:

aln = read.DNAMultipleAlignment(filepath=alnFiles[i], format="fasta")

This is a 4-species alignment, 300bp long. I finally figured out a way find all the "differences" (or, bases where the 4 species are not identical), like so:

which(apply(consensusMatrix(aln), 2, max) !=4)

It seems like there should be a more straightforward way, but this just looks at the count matrix and finds all the places that don't have all the species in the same nucleotide...

Now, I found no easy way to subset the aln object to pull out the sequences at those positions. Is there one?

I think maybe I want something like "mismatchTable", but according to the documentation, this won't work on a DNAMultipleAlignment... 

mismatchTable(aln)
Error in function (classes, fdef, mtable)  : 
  unable to find an inherited method for function "mismatchTable", for signature "DNAMultipleAlignment"

Instead, it works on a AlignedXStringSet. Well, I can't figure out the difference between an AlignedXStringSet and a DNAMultipleAlignment, which both appear to hold aligned strings, even XStringSets...and I cannot convert one to another.

Can anyone tell me where I should start to figure out the differences between these objects? And, how to solve my original problem of getting the the sequences at the locations that are different in an alignment?

Thanks for your help.

 -- output of sessionInfo(): 

R version 2.13.0 (2011-04-13)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] seqLogo_1.18.0      MotIV_1.6.0         rtracklayer_1.12.2 
[4] RCurl_1.6-4         bitops_1.0-4.1      GenomicRanges_1.4.3
[7] Biostrings_2.20.1   IRanges_1.10.3     

loaded via a namespace (and not attached):
[1] BSgenome_1.20.0 lattice_0.19-23 rGADEM_1.4.0    tools_2.13.0   
[5] XML_3.4-0      

--
Sent via the guest posting facility at bioconductor.org.