Circular genome clustering

Tathagata Debnath and Joe Song

Updated: 2021-07-27; 2020-11-29; 2020-09-05. Created: 2020-08-07

Optimal versus heuristic cluster borders on CpG sites of a circular bacterial genome

The fast optimal circular clustering (FOCC) (Debnath and Song 2021) and the heuristic repeated \(K\)-means circular clustering (HEUC) algorithms are applied on the CpG sites of the Candidatus Carsonella ruddii genome (GenBank accession number CP019943.1). Both algorithms clustered the CpG sites into 14 groups, as shown in the figure below.

The clusters obtained by FOCC algorithm are more compact and justifiable as compared to the HEUC ones. The cluster border between the C8 and C9 clusters of the optimal clustering are more subjectively justifiable as compared to the border between C4 and C8 clusters of the heuristic clustering outcome. The cluster borders are pointed by orange arrows inside the circular genome. A fixed seed for random number generation is used to force \(K\)-means to always return the same results.

Therefore, the advantage of optimal clustering over the heuristic clustering algorithm is evident in this example representing practical applications.

References

Debnath, Tathagata, and Mingzhou Song. 2021. “Fast Optimal Circular Clustering and Applications on Round Genomes.” IEEE/ACM Transactions on Computational Biology and Bioinformatics. https://doi.org/10.1109/TCBB.2021.3077573.