[R] sequence clustering and assembly

Thu Apr 15 14:33:11 CEST 2010

Hi Bogdan --

On 04/14/2010 08:19 PM, Bogdan Tanasa wrote:
> Dear all,
> 
> please could you suggest any R functions or packages (or external
> programs), that
likely you'll have more luck on the Bioconductor mailing list,

http://bioconductor.org/docs/mailList.html

but...

> a. take as input a large number (> 10 000) of short 20-30 nt
> sequences, and do sequence assembly, to reconstruct larger (extended)
> 30-50 sequences ?

I don't know of any sequence assemblers in R; velvet would be a first
stop third party tool but it sounds like you have some fairly specific
requirements....

> b. take as input a larger number of sequences (100 000 - 1 mil) and
> cluster these sequences in distinct classes based on the sequence
> similarity  ?

The Biostrings package has various functions to calculate edit distance,
which might form the input to familiar R clustering algorithms. See
installation instructions at

http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html

This thread

https://stat.ethz.ch/pipermail/bioconductor/2010-March/032580.html

might suggest some directions.

Martin

> 
> thanks a lot,
> 
> bogdan
> 
> [[alternative HTML version deleted]]
> 
> ______________________________________________ R-help at r-project.org
> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> read the posting guide http://www.R-project.org/posting-guide.html 
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793