[BioC] Group millions of the same DNA sequences?

Thu Nov 18 10:23:53 CET 2010

On Thu, Nov 18, 2010 at 09:21:07AM +0800, Xiaohui Wu wrote:
> Thank you Aaron!  Till now, sort and uniq may be the easiest way to do this.
> For clustering, I don't think assembler is suitable for my case. I want to
> cluster similar reads to get different clusters, each cluster has some reads,
> and do further analysis.

about the clustering, an approach like

   Fast approximate hierarchical clustering using similarity heuristics
   Meelis Kull and Jaak Vilo 

could be worthwhile. If the similarities obey the metric inequality,
it should not be necessary to do all-against-all comparisons.

best,
Stijn

-- 
Stijn van Dongen         >8<        -o)   O<  forename pronunciation: [Stan]
EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn