[BioC] Group millions of the same DNA sequences?
m_olshansky at yahoo.com
Tue Nov 16 23:32:28 CET 2010
You can use the aggregate function, i.e. if X is a vector of your sequences, i.e. X[i] is a character containing your i-th sequence (i=1,2,...,100M) then do
y <- aggregate(X,list(X),length)
then y is a two columns data.frame with column 1 containing the sequence and column 2 containing the count.
If this is too slow in R, use sort -u on Unix (like Wei suggested) to get the sequences sorted and unique and then write a simple C program which runs over all your sequences and adds 1 to count[i] where i is this sequence's index in the unique and sorted list (you can use binary search to get that i).
--- On Tue, 16/11/10, Xiaohui Wu <wux3 at muohio.edu> wrote:
> From: Xiaohui Wu <wux3 at muohio.edu>
> Subject: [BioC] Group millions of the same DNA sequences?
> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Received: Tuesday, 16 November, 2010, 9:46 PM
> Hi all,
> I have millions like 100M DNA reads each of which is
> ~150nt, some of them are duplicate. Is there any way to
> group the same sequences into one and count the number, like
> unique() function in R, but with the occurrence of read and
> also more efficient?
> Also, if I want to cluster these 100M reads based on
> their similarity, like editor distance or some distance
> <=2, is there some function or package can be used?
> Thank you!
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor