[BioC] Group millions of the same DNA sequences?

Thu Nov 18 10:53:51 CET 2010

you could try CD-HIT.  http://bioinformatics.ljcrf.edu/cd-hi/
meant to work with huge number of sequences.

regards
KM

On Thu, Nov 18, 2010 at 2:53 PM, Stijn van Dongen <stijn at ebi.ac.uk> wrote:
>
> On Thu, Nov 18, 2010 at 09:21:07AM +0800, Xiaohui Wu wrote:
>> Thank you Aaron!  Till now, sort and uniq may be the easiest way to do this.
>> For clustering, I don't think assembler is suitable for my case. I want to
>> cluster similar reads to get different clusters, each cluster has some reads,
>> and do further analysis.
>
> about the clustering, an approach like
>
>   Fast approximate hierarchical clustering using similarity heuristics
>   Meelis Kull and Jaak Vilo
>
> could be worthwhile. If the similarities obey the metric inequality,
> it should not be necessary to do all-against-all comparisons.
>
> best,
> Stijn
>
> --
> Stijn van Dongen         >8<        -o)   O<  forename pronunciation: [Stan]
> EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
> Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>