[BioC] Group millions of the same DNA sequences?

Wed Nov 17 16:29:50 CET 2010

On Nov 17, 2010, at 9:58 AM, Aaron Mackey wrote:
> sort -u | uniq -c will do the counting for you.

Actually,

	sort reads | uniq -c

> what is it you're trying to accomplish with these reads?...
> -Aaron


> 2010/11/16 Xiaohui Wu <wux3 at muohio.edu>
>
>> Hi Wei,
>>
>> Thank you for your reply! I'll be very appreciated if you could  
>> send me
>> your C code for reference.
>>
>> Xiaohui
>>
>> -------------------------------------------------------------
>> ·¢¼þÈË£ºWei Shi
>> ·¢ËÍÈÕÆÚ£º2010-11-17 05:55:04
>> ÊÕ¼þÈË£ºWu, Xiaohui Ms.
>> ³ ËÍ£ºbioconductor at stat.math.ethz.ch
>> Ö÷Ìâ£ºRe: [BioC] Group millions of the same DNA sequences?
>>
>> Dear Xiaohui:
>>
>>        The unix command sort (which is also available on Mac) can  
>> group
>> your sequences. But I guess you will have to write some code to  
>> count the
>> numbers for each distinct sequence. I have written some C code to  
>> do the
>> similar thing and I will be happy to share them. But I am not sure  
>> if your
>> data format is the same as mine.
>>
>>        I am not aware of any R functions which can cluster  
>> strings. I image
>> it will be extremely slow if there are such functions.
>>
>>        Hope this helps.
>>
>> Cheers,
>> Wei
>>
>> On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote:
>>
>>> Hi all,
>>>
>>> I have millions like 100M DNA reads each of which is ~150nt, some  
>>> of them
>> are duplicate. Is there any way to group the same sequences into  
>> one and
>> count the number, like unique() function in R, but with the  
>> occurrence of
>> read and also more efficient?
>>> Also, if I want to cluster these 100M  reads based on their  
>>> similarity,
>> like editor distance or some distance <=2, is there some function  
>> or package
>> can be used?
>>> Thank you!
>>>
>>> Xiaohui
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> _____________________________________________________________________ 
>> _
>> The information in this email is confidential and intended solely  
>> for the
>> addressee.
>> You must not disclose, forward, print or use it without the  
>> permission of
>> the sender.
>> _____________________________________________________________________ 
>> _
>> .
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/ 
> gmane.science.biology.informatics.conductor