[BioC] Biostring: print sequence alignment to file

Martin Preusse martin.preusse at googlemail.com
Sat Apr 21 11:55:25 CEST 2012


Hi Hervé,  

thanks for your help! If you need suggestions, help or testing, just say the word.

Will you implement the header also? If you do so, I would be thankful for an option like "header=F" for the output.


Cheers
Martin


Am Samstag, 21. April 2012 um 02:12 schrieb Hervé Pagès:

> Thanks Martin and Thomas for the useful feedback. The 'pair' and
> 'markx0' formats supported by Emboss seem indeed appropriate for
> printing the output of pairwiseAlignment() to a file. I'll add
> support for those 2 formats in Biostrings. Won't be before 1 week
> or 2 though...
>  
> Cheers,
> H.
>  
> On 04/18/2012 03:20 AM, Martin Preusse wrote:
> > Hi,
> >  
> > I just found this function to print a pairwise alignments in blocks. Doesn't add the match/mismatch indicators between sequences, but might be a starting point:
> >  
> > http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html#viewing-a-long-pairwise-alignment
> >  
> >  
> > Cheers
> > Martin
> >  
> >  
> >  
> > Am Mittwoch, 18. April 2012 um 12:16 schrieb Martin Preusse:
> >  
> > > Hi everybody,
> > >  
> > > I think the output format depends on the purpose of the alignment.
> > >  
> > > A pairwise sequence alignment is usually done to compare two sequences base by base. In my case, I compare sequencing results of cloned expression constructs with the desired sequence. Thus, the best output format would be "BLAST like".
> > >  
> > > seq1: 1 ATCTGC 7
> > > | | | . . |
> > > seq2: 1 ATCAAC 7
> > >  
> > > When doing MSA, most people might rather be interested in the consensus sequence. E.g. in the context of conservation between species.
> > >  
> > > So write.PairwiseAlignedXStringSet() and write.MultipleAlignment() are quite different and BLAST doesn't make much sense for multiple alignments. This means it would be best to put the output in the PairwiseAlignment/MultipleAlignment and not to the XStringSet, right?
> > >  
> > > This is an overview of sequence alignment formats used by EMBOSS:
> > > http://emboss.sourceforge.net/docs/themes/AlignFormats.html
> > >  
> > > 'pair' or 'markx0' would be perfectly fine.
> > >  
> > >  
> > > Cheers
> > > Martin
> > >  
> > >  
> > >  
> > > Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke:
> > >  
> > > > Hi Hervé,
> > > >  
> > > > To me, the most basic and versatile MSA or pairwise alignment format to output
> > > > to would be FASTA since it is compatible with almost any other alignment
> > > > editing software. For text-based viewing purposes my preference would be
> > > > to also output to a format similar to the one shown in the following
> > > > example. When there are only two sequences then one could show instead
> > > > of a consensus line the pipe characters between the two sequences to
> > > > indicate identical residues which mimics the blast output. A more
> > > > standardized version of this pairwise alignment format can be found
> > > > here:
> > > > http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html
> > > >  
> > > > library(Biostrings)
> > > > p450<- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta")
> > > >  
> > > > StringSet2html<- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) {
> > > > if(class(msa)=="AAStringSet") msa<- AAStringSet(msa, start=start, end=end)
> > > > if(class(msa)=="DNAStringSet") msa<- DNAStringSet(msa, start=start, end=end)
> > > > msavec<- sapply(msa, toString)
> > > > offset<- (counter-1)-nchar(nchar(msavec[1]))
> > > > legend<- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0,
> > > > nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="")
> > > > consensus<- consensusString(msavec, ambiguityMap=".", ...)
> > > > msavec<- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ")
> > > > msavec<- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec,
> > > > consensus), sep=" ")
> > > > msavec<- c("<html><pre>", msavec,"</pre></html>")
> > > > writeLines(msavec, file)
> > > > if(browser==TRUE) { browseURL(file) }
> > > > }
> > > > StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0)
> > > > StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0)
> > > >  
> > > >  
> > > > Thomas
> > > >  
> > > > On Tue, Apr 17, 2012 at 07:43:30PM +0000, Hervé Pagès wrote:
> > > > > Hi Thomas,
> > > > >  
> > > > > On 04/17/2012 11:49 AM, Thomas Girke wrote:
> > > > > > What about providing an option in pairwiseAlignment to output to the
> > > > > > MultipleAlignment class in Biostrings and then write the latter to
> > > > > > different alignment formats?
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > Or we could provide coercion methods to switch between
> > > > > PairwiseAlignedXStringSet and MultipleAlignment.
> > > > >  
> > > > > Anyway that kind of moves Martin's problem from having a
> > > > > write.PairwiseAlignedXStringSet() function that produces BLAST output
> > > > > to having a write.MultipleAlignment() function that produces BLAST
> > > > > output. For the specific case of BLAST output, would it make sense
> > > > > to support it for MultipleAlignment? Can someone point me to an example
> > > > > of such output? Or even better, to the specs of such format?
> > > > >  
> > > > > Note that right now there is the write.phylip() function in Biostrings
> > > > > for writing a MultipleAlignment object to a file but the Phylip format
> > > > > looks very different from the BLAST output:
> > > > >  
> > > > > hpages at latitude:~$ head -n 20 phylip_test.txt
> > > > > 9 2343
> > > > > Mask 0000000000 0000000000 0000000000 0000000000 0000000000
> > > > > Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG
> > > > > Chimp ---------- ---------- ---------- ---------- ----------
> > > > > Cow ---------- ---------- ---------- ---------- ----------
> > > > > Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG
> > > > > Rat ---------- ---------- ---------- ---------- ----------
> > > > > Dog ---------- ---------- ---------- ---------- ----------
> > > > > Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC
> > > > > Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG
> > > > >  
> > > > > 0000000000 0000000000 0000000000 0001111111 1111111111
> > > > > CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT
> > > > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > > > ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT
> > > > > CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT
> > > > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > > > AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC
> > > > > CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT
> > > > >  
> > > > > Thanks!
> > > > > H.
> > > > >  
> > > > > >  
> > > > > > Thomas
> > > > > >  
> > > > > > On Tue, Apr 17, 2012 at 05:59:24PM +0000, Hervé Pagès wrote:
> > > > > > > Hi Martin,
> > > > > > >  
> > > > > > > On 04/16/2012 04:06 AM, Martin Preusse wrote:
> > > > > > > > Hi Charles,
> > > > > > > >  
> > > > > > > > thanks! Your solution allows to print the two alignment strings separately.
> > > > > > > >  
> > > > > > > > I was thinking of an output as generated by alignment tools:
> > > > > > > >  
> > > > > > > > AGT-TCTAT
> > > > > > > > | | | | | | | | |
> > > > > > > > AGTATCTAT
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > > This looks like BLAST output. Is this what you have in mind? Note that
> > > > > > > there are many alignment tools and many ways to output the result to a
> > > > > > > file. I'm not really familiar with the BLAST output format. Is it
> > > > > > > specified somewhere? Would that make sense to add something like a
> > > > > > > write.PairwiseAlignedXStringSet() function to Biostrings for writing
> > > > > > > the result of pairwiseAlignment() to a file? We could do this and
> > > > > > > support the BLAST format if that's a commonly used format.
> > > > > > >  
> > > > > > > Thanks,
> > > > > > > H.
> > > > > > >  
> > > > > > > >  
> > > > > > > > For this I would have to write a function to output the strings in blocks of e.g. 60 nucleotides, right?
> > > > > > > >  
> > > > > > > > Cheers
> > > > > > > > Martin
> > > > > > > >  
> > > > > > > >  
> > > > > > > >  
> > > > > > > > Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles:
> > > > > > > >  
> > > > > > > > > write.XStringSet
> > > > > > > >  
> > > > > > > >  
> > > > > > > > _______________________________________________
> > > > > > > > Bioconductor mailing list
> > > > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > > --
> > > > > > > Hervé Pagès
> > > > > > >  
> > > > > > > Program in Computational Biology
> > > > > > > Division of Public Health Sciences
> > > > > > > Fred Hutchinson Cancer Research Center
> > > > > > > 1100 Fairview Ave. N, M1-B514
> > > > > > > P.O. Box 19024
> > > > > > > Seattle, WA 98109-1024
> > > > > > >  
> > > > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > > > > > > Phone: (206) 667-5791
> > > > > > > Fax: (206) 667-1319
> > > > > > >  
> > > > > > > _______________________________________________
> > > > > > > Bioconductor mailing list
> > > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > --
> > > > > Hervé Pagès
> > > > >  
> > > > > Program in Computational Biology
> > > > > Division of Public Health Sciences
> > > > > Fred Hutchinson Cancer Research Center
> > > > > 1100 Fairview Ave. N, M1-B514
> > > > > P.O. Box 19024
> > > > > Seattle, WA 98109-1024
> > > > >  
> > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > > > > Phone: (206) 667-5791
> > > > > Fax: (206) 667-1319
> > > >  
> > >  
> >  
>  
>  
>  
>  
> --  
> Hervé Pagès
>  
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>  
> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> Phone: (206) 667-5791
> Fax: (206) 667-1319



More information about the Bioconductor mailing list