[BioC] Group millions of the same DNA sequences?
Harris A. Jaffee
hj at jhu.edu
Wed Nov 17 16:29:50 CET 2010
On Nov 17, 2010, at 9:58 AM, Aaron Mackey wrote:
> sort -u | uniq -c will do the counting for you.
Actually,
sort reads | uniq -c
> what is it you're trying to accomplish with these reads?...
> -Aaron
> 2010/11/16 Xiaohui Wu <wux3 at muohio.edu>
>
>> Hi Wei,
>>
>> Thank you for your reply! I'll be very appreciated if you could
>> send me
>> your C code for reference.
>>
>> Xiaohui
>>
>> -------------------------------------------------------------
>> ·¢¼þÈË£ºWei Shi
>> ·¢ËÍÈÕÆÚ£º2010-11-17 05:55:04
>> ÊÕ¼þÈË£ºWu, Xiaohui Ms.
>> ³ ËÍ£ºbioconductor at stat.math.ethz.ch
>> Ö÷Ì⣺Re: [BioC] Group millions of the same DNA sequences?
>>
>> Dear Xiaohui:
>>
>> The unix command sort (which is also available on Mac) can
>> group
>> your sequences. But I guess you will have to write some code to
>> count the
>> numbers for each distinct sequence. I have written some C code to
>> do the
>> similar thing and I will be happy to share them. But I am not sure
>> if your
>> data format is the same as mine.
>>
>> I am not aware of any R functions which can cluster
>> strings. I image
>> it will be extremely slow if there are such functions.
>>
>> Hope this helps.
>>
>> Cheers,
>> Wei
>>
>> On Nov 16, 2010, at 9:46 PM, Xiaohui Wu wrote:
>>
>>> Hi all,
>>>
>>> I have millions like 100M DNA reads each of which is ~150nt, some
>>> of them
>> are duplicate. Is there any way to group the same sequences into
>> one and
>> count the number, like unique() function in R, but with the
>> occurrence of
>> read and also more efficient?
>>> Also, if I want to cluster these 100M reads based on their
>>> similarity,
>> like editor distance or some distance <=2, is there some function
>> or package
>> can be used?
>>> Thank you!
>>>
>>> Xiaohui
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> _____________________________________________________________________
>> _
>> The information in this email is confidential and intended solely
>> for the
>> addressee.
>> You must not disclose, forward, print or use it without the
>> permission of
>> the sender.
>> _____________________________________________________________________
>> _
>> .
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/
> gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list