[BioC] Group millions of the same DNA sequences?
km
srikrishnamohan at gmail.com
Thu Nov 18 10:53:51 CET 2010
you could try CD-HIT. http://bioinformatics.ljcrf.edu/cd-hi/
meant to work with huge number of sequences.
regards
KM
On Thu, Nov 18, 2010 at 2:53 PM, Stijn van Dongen <stijn at ebi.ac.uk> wrote:
>
> On Thu, Nov 18, 2010 at 09:21:07AM +0800, Xiaohui Wu wrote:
>> Thank you Aaron! Till now, sort and uniq may be the easiest way to do this.
>> For clustering, I don't think assembler is suitable for my case. I want to
>> cluster similar reads to get different clusters, each cluster has some reads,
>> and do further analysis.
>
> about the clustering, an approach like
>
> Fast approximate hierarchical clustering using similarity heuristics
> Meelis Kull and Jaak Vilo
>
> could be worthwhile. If the similarities obey the metric inequality,
> it should not be necessary to do all-against-all comparisons.
>
> best,
> Stijn
>
> --
> Stijn van Dongen >8< -o) O< forename pronunciation: [Stan]
> EMBL-EBI /\\ Tel: +44-(0)1223-492675
> Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list