[BioC] Group millions of the same DNA sequences?

Tue Nov 16 23:04:30 CET 2010

Dear Xiaohul
Might be better to use CAP3 (contig assembly program) rather than R (or anything else) which is also freeware, unless you have a lot of SSR's. If you have a lot of SSR's nothing will help.
Cheers
Bob
________________________________________
From: bioconductor-bounces at stat.math.ethz.ch [bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Wei Shi [shi at wehi.edu.au]
Sent: Tuesday, November 16, 2010 4:54 PM
To: Xiaohui Wu
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Group millions of the same DNA sequences?

Dear Xiaohui:

        The unix command sort (which is also available on Mac) can group your sequences. But I guess you will have to write some code to count the numbers for each distinct sequence. I have written some C code to do the similar thing and I will be happy to share them. But I am not sure if your data format is the same as mine.

        I am not aware of any R functions which can cluster strings. I image it will be extremely slow if there are such functions.

        Hope this helps.

Cheers,
Wei

On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote:

> Hi all,
>
> I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient?
> Also, if I want to cluster these 100M  reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used?
> Thank you!
>
> Xiaohui
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:9}}