[BioC] Group millions of the same DNA sequences?
Xiaohui Wu
wux3 at muohio.edu
Tue Nov 16 11:46:13 CET 2010
Hi all,
I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient?
Also, if I want to cluster these 100M reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used?
Thank you!
Xiaohui
More information about the Bioconductor
mailing list