[BioC] Group millions of the same DNA sequences?

Thu Nov 18 02:21:07 CET 2010

Thank you Aaron!  Till now, sort and uniq may be the easiest way to do this. For clustering, I don't think assembler is suitable for my case. I want to cluster similar reads to get different clusters, each cluster has some reads, and do further analysis.

Xiaohui

-------------------------------------------------------------
发件人：Aaron Mackey
发送日期：2010-11-17 22:58:57
收件人：Wu, Xiaohui Ms.
抄送：bioconductor at stat.math.ethz.ch
主题：Re: [BioC] Group millions of the same DNA sequences?

sort -u | uniq -c will do the counting for you.

but as these are sequence data, you should probably be concerned about indels (gaps) as well as mismatches, so the "edit distance" is not a great clustering metric.  You could, of course, calculate all-vs-all Needleman-Wunsch pairwise alignments, from which you could count edit operations (counting any indel as one operation, regardless of length), and cluster by some edit operation threshold, but that would be computationally expensive.

what is it you're trying to accomplish with these reads?  Assembly?  If so, just give your reads to an assembler (MIRA, Velvet, etc.) and let the assembler do this clustering for you.

-Aaron

2010/11/16 Xiaohui Wu <wux3 at muohio.edu<mailto:wux3 at muohio.edu>>
Hi Wei,

Thank you for your reply! I'll be very appreciated if you could send me your C code for reference.

Xiaohui

-------------------------------------------------------------
发件人：Wei Shi
发送日期：2010-11-17 05:55:04
收件人：Wu, Xiaohui Ms.
抄送：bioconductor at stat.math.ethz.ch<mailto:bioconductor at stat.math.ethz.ch>
主题：Re: [BioC] Group millions of the same DNA sequences?

Dear Xiaohui:

       The unix command sort (which is also available on Mac) can group your sequences. But I guess you will have to write some code to count the numbers for each distinct sequence. I have written some C code to do the similar thing and I will be happy to share them. But I am not sure if your data format is the same as mine.

       I am not aware of any R functions which can cluster strings. I image it will be extremely slow if there are such functions.

       Hope this helps.

Cheers,
Wei

On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote:

> Hi all,
>
> I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient?
> Also, if I want to cluster these 100M  reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used?
> Thank you!
>
> Xiaohui
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch<mailto:Bioconductor at stat.math.ethz.ch>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

______________________________________________________________________
The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.
______________________________________________________________________
.
_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch<mailto:Bioconductor at stat.math.ethz.ch>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor