[R-sig-phylo] DNA sequence management for phylogenetics in R

Brian O'Meara bcomeara at nescent.org
Tue Mar 17 17:46:37 CET 2009

Christoph Heibl has some R code that calls mafft for alignment (which  
I currently like better than Clustal, btw) and others that can  
interact with a postgreSQL database for storing info [according to the  
software description -- I haven't tried this]. See <http://www.christophheibl.de/Rpackages.html 


On Mar 17, 2009, at 12:09 PM, Emmanuel Paradis wrote:

> Dan,
> It seems that the way DNA sequences are coded in ape with the class  
> "DNAbin" meets some of the criteria you list below. Sequences are  
> stored in vectors, lists of vectors, or matrices. The usual methods  
> for extracting and subsetting ([, [[, $) have been written for this  
> class. There are also methods for rbind and cbind. I have modified  
> them recently so that "super-matrices" can be built eventually  
> filling some columns/rows with gaps.
> There is no way to do sequence alignment directly into R at the  
> moment, but Clustal can be called with the system() function and  
> read.dna() can read clustal alignment files, so this can be scripted  
> easily.
> About seqinr: I found it very useful for various things. I used it  
> recently to translate DNA into AA sequences. It can also return the  
> complement of a DNA sequence (ape cannot for the moment).
> EP
> Le 06.03.2009 16:26, Dan Rabosky a écrit :
>> Howdy-
>> Maybe this is the wrong approach altogether, but I have been using  
>> R  to manage large sets of DNA sequences, to generate input files  
>> for  various phylogeny programs, to keep track of output from  
>> large  sequencing projects, and so on. I have written a bit of code  
>> for  this, but it would be nice to get thoughts/input from others  
>> on what  an ideal approach might look like.
>> I can see where my current approach is probably going to hit some   
>> walls. It also seems to me that this could be useful for many  
>> folks.  So, before I redesign my own system, perhaps it would be  
>> nice to  collectively come up with something useful in this area. I  
>> would be  very interested in any brainstorming, suggestions, or  
>> thoughts on  what has and has not worked for you, if you have been  
>> wrestling with  something along these lines. Disclaimer: I haven't  
>> used the seqinr  package (just noticed it), so I would also be  
>> interested in opinions  on whether people find this to be an  
>> adequate base package.
>> As a starting point, I'm assuming that you are working with  
>> cleaned  sequences, that you have thousands of individual sequences  
>> from many  loci, potentially thousands of individuals - each of  
>> which may been  sequenced for some, but not necessarily all, loci.
>> As I've done this so far, I store all individual sequences as in a   
>> general class of object which contains the relevant taxonomic  
>> info,  genbank ID (if available), and other info. I can then feed  
>> character  vectors of individual ID's plus a second vector of locus  
>> names and I  can generate a range of files in the appropriate  
>> format for various  phylogeny software. If taxa cannot be found,  
>> they receive the  relevant missing data code in that block of the  
>> input file.
>> Anyway, ideally, one would want:
>> (1) the ability to seamlessly integrate new sequences into the   
>> existing database from a variety of sources (new user-generated   
>> sequences, large fasta files from genbank, etc).
>> (2) Some way of dealing with alignments. Suppose I have 500   
>> individuals for some particular locus; everything is aligned and  
>> the  aligned seqs go into the database. But if I add more  
>> individuals, I  need to re-align everything with the new  
>> individuals and update the  database accordingly. I think it might  
>> be useful to keep track of  three pieces of information here: (a)  
>> the raw sequence, (b) the  aligned sequence, and (c) the 'version'  
>> or date of alignment. Thus,  if you added new sequences to the  
>> database or merged several  databases, you could check item (c) to  
>> see if they were of the same  alignment version. If not, then you  
>> export everything for alignment,  align, import, and update the  
>> database accordingly.
>> (3) A manner of storing the actual data that has some  
>> transparency.  Meaning that if I create some bizarre file format  
>> for storing all of  this info, it might be really difficult for  
>> someone other than myself  to interpret. As I do it now, I use an  
>> xml-type format is generally  unintelligible. Because others who do  
>> not use R may wish to access  these data the hard way, this is  
>> problematic.
>> Perhaps I would be better off with SQL, or maybe these tools are  
>> so  obviously available in perl, bioperl, etc that it will seem odd  
>> that  I even attempted this. If so, I'd like to be pointed in the  
>> right  direction!
>> Thanks!
>> ~Dan
>> Dan Rabosky
>> Department of Ecology and Evolutionary Biology &
>> Fuller Evolutionary Biology Program
>> Cornell Lab of Ornithology
>> Cornell University
>> Ithaca, NY 14853-2701
>> new website:
>> http://www.eeb.cornell.edu/Rabosky/dan/main.html
>> 	[[alternative HTML version deleted]]
>> _______________________________________________
>> R-sig-phylo mailing list
>> R-sig-phylo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> -- 
> Emmanuel Paradis
> IRD, Montpellier, France
>  ph: +33 (0)4 67 16 64 47
> fax: +33 (0)4 67 16 64 40
> http://ape.mpl.ird.fr/
> _______________________________________________
> R-sig-phylo mailing list
> R-sig-phylo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

More information about the R-sig-phylo mailing list