[BioC] Group millions of the same DNA sequences?

Xiaohui Wu wux3 at muohio.edu
Tue Nov 16 11:46:13 CET 2010


Hi all,

I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? 
Also, if I want to cluster these 100M  reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? 
Thank you!

Xiaohui



More information about the Bioconductor mailing list