Predicting Evolutionary Importance of Amino Acids through Mutation of Codons Using K-means Clustering
Abstract
Mutation is a random biological event that may cause permanent (long term) change in living organism induced by several structural or composition alteration in the proteins. During mutation genetic materials such as nucleotide bases in the codons is changed which potentially contributed to the alteration in the codons and consequently the amino acid that new codon encodes. In this study mutation at different nucleotide base positions within the codons is analyzed to understand the evolutionary importance of amino acids. By creating hypothetical mutations at the first, second and third positions of all 61 codons (excluding stop codons) and using K-means clustering, we categorized the resulting amino acids. Our analysis reveals that mutations at the second base position generate the highest number of distinct amino acids, indicating greater evolutionary significance compared to first and third position mutations. We applied the proposed framework on COVID-2 SARS-CoV-2 (Homo sapiens) amino acid sequence and are able to deduce several significant findings about the mutation patterns. The clustering analysis revealed that amino acids such as Glycine (G), Alanine (A), Proline (P), Valine (V) and one polar amino acid are recurrent in the combined centroids of the clusters. These amino acids, predominantly hydrophobic, play a crucial role in stabilizing protein structures. This framework not only gives the insight understanding of mutation patterns and their biological significance but also underscores the importance of specific amino acids in the evolutionary process.
Downloads
References
Y. Liu, Y. Liu, and Z. Li, “Protein–Protein Interaction Prediction via Structure‐Based Deep Learning,” Proteins, p. prot.26721, Jun. 2024, doi: 10.1002/prot.26721.
S. Ohno, N. Manabe, and Y. Yamaguchi, “Prediction of protein structure and AI,” J Hum Genet, Jan. 2024, doi: 10.1038/s10038-023-01215-4.
D. Listov, C. A. Goverde, B. E. Correia, and S. J. Fleishman, “Opportunities and challenges in design and optimization of protein function,” Nat Rev Mol Cell Biol, Apr. 2024, doi: 10.1038/s41580-024-00718-y.
P. Notin, N. Rollins, Y. Gal, C. Sander, and D. Marks, “Machine learning for functional protein design,” Nat Biotechnol, vol. 42, no. 2, pp. 216–228, Feb. 2024, doi: 10.1038/s41587-024-02127-0.
S. Patil, J. Seth, and A. Ojha, “Investigating the Role of HPC in AI-based Protein-Protein Interaction Analysis,” in 2024 IEEE 13th International Conference on Communication Systems and Network Technologies (CSNT), Jabalpur, India: IEEE, Apr. 2024, pp. 1003–1009. doi: 10.1109/CSNT60213.2024.10545781.
K. H. Sumida et al., “Improving Protein Expression, Stability, and Function with ProteinMPNN,” J. Am. Chem. Soc., vol. 146, no. 3, pp. 2054–2061, Jan. 2024, doi: 10.1021/jacs.3c10941.
Q. Zhang, B. Liu, G. Cai, J. Qian, and Z. Jin, “Application of the AlphaFold2 Protein Prediction Algorithm Based on Artificial Intelligence,” JTPES, vol. 4, no. 02, pp. 58–65, Feb. 2024, doi: 10.53469/jtpes.2024.04(02).09.
Y. Zhou, K. Tan, X. Shen, Z. He, and H. Zheng, “A Protein Structure Prediction Approach Leveraging Transformer and CNN Integration,” in 2024 7th International Conference on Advanced Algorithms and Control Engineering (ICAACE), Shanghai, China: IEEE, Mar. 2024, pp. 749–753. doi: 10.1109/ICAACE61206.2024.10548253.
Z. He, X. Shen, Y. Zhou, and Y. Wang, “Application of K-means clustering based on artificial intelligence in gene statistics of biological information engineering,” in Proceedings of the 2024 4th International Conference on Bioinformatics and Intelligent Computing, Beijing China: ACM, Jan. 2024, pp. 468–473. doi: 10.1145/3665689.3665767.
X. Liu, J. Xing, H. Fu, X. Shao, and W. Cai, “Analyzing Molecular Dynamics Trajectories Thermodynamically through Artificial Intelligence,” J. Chem. Theory Comput., vol. 20, no. 2, pp. 665–676, Jan. 2024, doi: 10.1021/acs.jctc.3c00975.
Y. Jiang, Y. Dang, Q. Wu, B. Yuan, L. Gao, and C. You, “Using a k-means clustering to identify novel phenotypes of acute ischemic stroke and development of its Clinlabomics models,” Front. Neurol., vol. 15, p. 1366307, Mar. 2024, doi: 10.3389/fneur.2024.1366307.
L. Chen, D. R. Roe, M. Kochert, C. Simmerling, and R. A. Miranda-Quintana, “k-Means NANI: An Improved Clustering Algorithm for Molecular Dynamics Simulations,” J. Chem. Theory Comput., vol. 20, no. 13, pp. 5583–5597, Jul. 2024, doi: 10.1021/acs.jctc.4c00308.
J. Huan, D. Bandyopadhyay, W. Wang, J. Snoeyink, J. Prins, and A. Tropsha, “Comparing Graph Representations of Protein Structure for Mining Family-Specific Residue-Based Packing Motifs,” Journal of Computational Biology, vol. 12, no. 6, pp. 657–671, Jul. 2005, doi: 10.1089/cmb.2005.12.657.
M. Zamani and S. C. Kremer, “Amino acid encoding schemes for machine learning methods,” in 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), Atlanta, GA: IEEE, Nov. 2011, pp. 327–333. doi: 10.1109/BIBMW.2011.6112394.
M. Nasr Azadani, N. Ghadiri, and E. Davoodijam, “Graph-based biomedical text summarization: An itemset mining and sentence clustering approach,” Journal of Biomedical Informatics, vol. 84, pp. 42–58, Aug. 2018, doi: 10.1016/j.jbi.2018.06.005.
Y. Hou et al., “Fourier-transform infrared spectroscopy and machine learning to predict amino acid content of nine commercial insects,” Food Sci. Technol, vol. 42, p. e100821, 2022, doi: 10.1590/fst.100821.
M.-R. Rafieezade and A. Fazeli, “Predicting the amino group pKa of amino acids using machine learning-QSPR methods,” Journal of Molecular Liquids, vol. 408, p. 125355, Aug. 2024, doi: 10.1016/j.molliq.2024.125355.
Q. Yuan, J. Chen, H. Zhao, Y. Zhou, and Y. Yang, “Structure-aware protein–protein interaction site prediction using deep graph convolutional network,” Bioinformatics, vol. 38, no. 1, pp. 125–132, Dec. 2021, doi: 10.1093/bioinformatics/btab643.
P. Thangavel, P. Shyamala Anto Mary, G. Kavitha, and K. Deiwakumari, “Graph Theory And Network Analysis: Exploring Connectivity In Computer Science,” vol. 35, pp. 4884–4899, 2023.
W. Alkady, K. ElBahnasy, V. Leiva, and W. Gad, “Classifying COVID-19 based on amino acids encoding with machine learning algorithms,” Chemometrics and Intelligent Laboratory Systems, vol. 224, p. 104535, May 2022, doi: 10.1016/j.chemolab.2022.104535.
K. M. Biswas, D. R. DeVido, and J. G. Dorsey, “Evaluation of methods for measuring amino acid hydrophobicities and interactions,” Journal of Chromatography A, vol. 1000, no. 1–2, pp. 637–655, 2003.
T. Ali, A. Akhtar, and N. Gohain, “Analysis of amino acids network based on distance matrix,” Physica A: Statistical Mechanics and its Applications, vol. 452, pp. 69–78, 2016.
T. Tang et al., “Machine learning on protein–protein interaction prediction: models, challenges and trends,” Briefings in Bioinformatics, vol. 24, no. 2, p. bbad076, 2023.
Y. Li, “Patterns and Line-Adjacency Matrix Strategy on a Game of Induced Matching,” Galore International Journal of Applied Sciences and Humanities, vol. 7, no. 3, pp. 31–37, 2023.
I. Takigawa and H. Mamitsuka, “Graph mining: procedure, application to drug discovery and recent advances,” Drug discovery today, vol. 18, no. 1–2, pp. 50–57, 2013.
J. Fang, “A critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation,” Briefings in bioinformatics, vol. 21, no. 4, pp. 1285–1292, 2020.
M. Masso and I. I. Vaisman, “Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis,” Bioinformatics, vol. 24, no. 18, pp. 2002–2009, 2008.
A. P. Pandurangan and T. L. Blundell, “Prediction of impacts of mutations on protein structure and interactions: SDM, a statistical approach, and mCSM, using machine learning,” Protein Science, vol. 29, no. 1, pp. 247–257, Jan. 2020, doi: 10.1002/pro.3774.
B. Shen, J. Bai, and M. Vihinen, “Physicochemical feature-based classification of amino acid mutations,” Protein Engineering, Design & Selection, vol. 21, no. 1, pp. 37–44, 2008.
F. Noé, G. De Fabritiis, and C. Clementi, “Machine learning for protein folding and dynamics,” Current opinion in structural biology, vol. 60, pp. 77–84, 2020.
R. Casadio, P. L. Martelli, and C. Savojardo, “Machine learning solutions for predicting protein–protein interactions,” WIREs Comput Mol Sci, vol. 12, no. 6, p. e1618, Nov. 2022, doi: 10.1002/wcms.1618.
M. Zhang, Q. Su, Y. Lu, M. Zhao, and B. Niu, “Application of machine learning approaches for protein-protein interactions prediction,” Medicinal Chemistry, vol. 13, no. 6, pp. 506–514, 2017.
D. Sarkar and S. Saha, “Machine-learning techniques for the prediction of protein–protein interactions,” J Biosci, vol. 44, no. 4, p. 104, Sep. 2019, doi: 10.1007/s12038-019-9909-z.
J. Cheng, A. N. Tegge, and P. Baldi, “Machine learning methods for protein structure prediction,” IEEE reviews in biomedical engineering, vol. 1, pp. 41–49, 2008.
K. K. Yang, Z. Wu, C. N. Bedbrook, and F. H. Arnold, “Learned protein embeddings for machine learning,” Bioinformatics, vol. 34, no. 15, pp. 2642–2648, 2018.
Copyright (c) 2024 Nasrin Irshad Hussain, Kuntala Boruah, Adil Akhtar
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).