Machine‐learning Methods for Predicting Gene Function, Protein Structure and Genomic Variation effects in Precision Biology
Subject Areas : Biotechnological Journal of Environmental Microbiology
Abdul Razak Mohamed Sikkander
1
,
Joel J. P. C. Rodrigues
2
,
Hala S. Abuelmakarem
3
,
Manoharan Meena
4
1 -
2 -
3 -
4 -
Keywords: Machine learning, gene function prediction, protein structure, genomic variation, deep learning, bioinformatics, functional genomics,
Abstract :
Machine learning (ML) has emerged as a transformative approach in computational genomics, offering powerful tools to predict gene function, model protein structure, and identify genomic variations with unprecedented accuracy. Traditional bioinformatics methods, though effective, often struggle with the massive dimensionality and non-linear relationships inherent in genomic datasets. ML algorithms—such as random forests, support vector machines, convolutional and transformer neural networks—can learn complex representations from heterogeneous biological data, enabling functional annotation of uncharacterized genes, accurate modeling of protein folding, and detection of pathogenic variants. This paper explores the methodologies, results, and implications of integrating ML models in genomics and proteomics. A hypothetical dataset is presented to illustrate gene–function prediction, protein-structure inference, and variant classification using supervised and deep-learning frameworks. Results indicate that ML approaches can significantly outperform conventional statistical pipelines in prediction accuracy, generalization, and scalability. However, interpretability, data imbalance, and transferability across species remain major challenges. The discussion emphasizes the synergistic integration of ML with experimental validation, while future perspectives highlight the potential of foundation models and multimodal learning for functional genomics. Collectively, these advances bring us closer to a predictive, data-driven understanding of life’s molecular machinery.
1. Haseltine WA, Patarca R. The RNA Revolution in the Central Molecular Biology Dogma Evolution. International Journal of Molecular Sciences. 2024; 25(23):12695. https://doi.org/10.3390/ijms252312695
2. Smýkal, P.; Varshney, R.K.; Singh, V.K.; Coyne, C.J.; Domoney, C.; Kejnovský, E.; Warkentin, T. From Mendel’s discovery on pea to today’s plant genetics and breeding: Commemorating the 150th anniversary of the reading of Mendel’s discovery. Theor. Appl. Genet. 2016, 129, 2267–2280.
3. Luria, S.E. Genetics of bacteriophage. Annu. Rev. Microbiol. 1962, 16, 205–240
4. Crick, F.H. On protein synthesis. Symp. Soc. Exp. Biol. 1958, 12, 138–163.
5. Crick, F.H. The origin of the genetic code. J. Mol. Biol. 1968, 38, 367–379.
6. Crick, F. Central dogma of molecular biology. Nature 1970, 227, 561–563.
7. Crick, F.H.; Barnett, L.; Brenner, S.; Watts-Tobin, R.J. General nature of the genetic code for proteins. Nature 1961, 192, 1227–1232.
8. Jain, N.; Blauch, L.R.; Szymanski, M.R.; Das, R.; Tang, S.K.Y.; Yin, Y.W.; Fire, A.Z. Transcription polymerase-catalyzed emergence of novel RNA replicons. Science 2020, 368, eaay0688.
9. O’Reilly, E.K.; Kao, C.C. Analysis of RNA-dependent RNA polymerase structure and function as guided by known polymerase structures and computer predictions of secondary structure. Virology 1998, 252, 287–303.
10. de Farias, S.T.; Dos Santos Junior, A.P.; Rêgo, T.G.; José, M.V. Origin and Evolution of RNA-Dependent RNA Polymerase. Front. Genet. 2017, 8, 125.
11. Hein ZM, Guruparan D, Okunsai B, Che Mohd Nassir CMN, Ramli MDC, Kumar S. AI and Machine Learning in Biology: From Genes to Proteins. Biology. 2025; 14(10):1453. https://doi.org/10.3390/biology14101453
12. Way, G.P.; Greene, C.S.; Carninci, P.; Carvalho, B.S.; de Hoon, M.; Finley, S.D.; Gosline, S.J.C.; Lê Cao, K.-A.; Lee, J.S.H.; Marchionni, L.; et al. A field guide to cultivating computational biology. PLoS Biol. 2021, 19, e3001419.
13. Libbrecht, M.W.; Noble, W.S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 2015, 16, 321–332.
14. Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838.
15. Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589.
16. Machine Learning Industry Trends Report Data Book, 2022–2030. Available online: https://www.grandviewresearch.com/sector-report/machine-learning-industry-data-book (accessed on 9 May 2023).
17. Wang, H.; Raj, B. On the origin of deep learning. arXiv 2017, arXiv:1702.07800.
18. Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Jumper, J.M. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 630, 493–500.
19. Feng, C.; Wang, W.; Han, R.; Wang, Z.; Ye, L.; Du, Z.; Wei, H.; Zhang, F.; Peng, Z.; Yang, J. Accurate de novo prediction of RNA 3D structure with transformer network. bioRxiv 2022.
20. Li, C.; Yan, Y.; Lin, W.; Zhang, Y. Enhancing cancer subtype classification through convolutional neural networks: A deepinsight analysis of TCGA gene expression data. Health Inf. Sci. Syst. 2025, 13, 33.
21. Li Z, Liao B, Li Y, Liu W, Chen M, Cai L. Gene function prediction based on combining gene ontology hierarchy with multi-instance multi-label learning. RSC Adv. 2018 Aug 10;8(50):28503-28509. doi: 10.1039/c8ra05122d.
22. Hassan SU, Abdulkadir SJ, Zahid MSM, Al-Selwi SM. Local interpretable model-agnostic explanation approach for medical imaging analysis: A systematic literature review. Computers in Biology and Medicine. 2024;185:109569. doi:10.1016/j.compbiomed.2024.109569
23. Bakare OS. AI-Driven Multi-Omics integration for precision medicine in complex disease diagnosis and treatment. International Journal of Research Publication and Reviews. 2025;6(6):5070-5084. doi:10.55248/gengpi.6.0125.0650
24. Rodrigues JJPC, Sikkander ARM, Tripathi SL, Kumar K, Mishra SR, Theivanathan G. Healthcare applications of computational genomics. In: Elsevier eBooks. ; 2025:259-278. doi:10.1016/b978-0-443-30080-6.00012-2
25. Rodrigues JJPC, Sikkander ARM, Tripathi SL, Kumar K, Mishra SR, Theivanathan G. Artificial intelligence’s applicability in cardiac imaging. In: Elsevier eBooks. ; 2025:181-195. doi:10.1016/b978-0-443-30080-6.00006-7
26. Sikkander ARM, Tripathi SL, Theivanathan G. Extensive sequence analysis: revealing genomic knowledge throughout various domains. In: Elsevier eBooks. ; 2025:17-30. doi:10.1016/b978-0-443-30080-6.00007-9
27. Consortium EP, et al. , An integrated encyclopedia of DNA elements in the human genome, Nature 489 (7414) (2012) 57.
28. Kundaje A, et al. , Integrative analysis of 111 reference human epigenomes, Nature 518 (7539) (2015) 317–330.
29. Quake SR, Wyss-Coray T, Darmanis S, Consortium TM, et al. , Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris, bioRxiv (2018) 237446.
30. Wilhelm M, Schlegl J, Hahne H, Gholami AM, Lieberenz M, Savitski MM, Ziegler E, Butzmann L, Gessulat S, Marx H, et al. , Mass-spectrometry-based draft of the human proteome, Nature 509 (7502) (2014) 582.
31. Costanzo M, VanderSluis B, Koch EN, Baryshnikova A, Pons C,Tan G, Wang W, Usaj M, Hanchard J, Lee SD, et al. , A global genetic interaction network maps a wiring diagram of cellular function, Science 353 (6306) (2016) aaf1420.
32. Li X, Dunn J, Salins D, Zhou G, Zhou W, Rose SMS-F, Perel- man D, Colbert E, Runge R, Rego S, et al. , Digital health: tracking physiomes and activity using wearable biosensors reveals useful health- related information, PLoS Biology 15 (1) (2017) e2001402.
33. Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park J-H, Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies, Nature Genetics 45 (4) (2013) 400.
34. Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D, Methods of integrating data to uncover genotype-phenotype interactions, Nature Reviews Genetics 16 (2) (2015) 85–97.
35. Karczewski KJ, Snyder MP, Integrative omics for health and disease, Nature Reviews Genetics
36. Teschendorff AE, Relton CL, Statistical and integrative system- level analysis of DNA methylation data, Nature Reviews Genetics 19 (3) (2018) 129.
37. Hu Y, Shmygelska A, Tran D, Eriksson N, Tung JY, Hinds DA, GWAS of 89,283 individuals identifies genetic variants associated with self-reporting of being a morning person, Nature Communications 7 (2016) 10448
38. .Linghu B, Snitkin ES, Hu Z, Xia Y, DeLisi C, Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network, Genome Biology 10 (9) (2009) R91.
39. Hofree M, Shen JP, Carter H, Gross A, Ideker T, Network-based stratification of tumor mutations, Nature Methods 10 (11) (2013) 1108–1115.
40. Lundby A, Rossin EJ, Steffensen AB, Acha MR, Newton- Cheh C, Pfeufer A, Lynch SN, Olesen S-P, Brunak S, Ellinor PT, et al. , Annotation of loci from genome-wide association studies using tissue-specific quantitative interaction proteomics, Nature Methods 11 (8) (2014) 868–874.
41. Zitnik M, Zupan B, Data imputation in epistatic maps by network- guided matrix completion, Journal of Computational Biology 22 (6) (2015) 595–608.
42. Hyde CL, Nagle MW, Tian C, Chen X, Paciga SA, Wend- land JR, Tung JY, Hinds DA, Perlis RH, Winslow AR, Identification of 15 genetic loci associated with risk of major depression in individuals of European descent, Nature Genetics 48 (9) (2016) 1031–1036.
43. Menche J, et al. , Uncovering disease-disease relationships through the incomplete interactome, Science 347 (6224) (2015) 1257601.
44. Campillos M, et al. , Drug target identification using side-effect similarity, Science 321 (5886) (2008) 263–266.
45. Meng Y, Zhang Z, Zhou C, Tang X, Hu X, Tian G, Yang J, Yao Y. Protein structure prediction via deep learning: an in-depth review. Front Pharmacol. 2025 Apr 3;16:1498662. doi: 10.3389/fphar.2025.1498662
46. Ki M-R, Kim DH, Abdelhamid MAA, Pack SP. Cancer and Aging Biomarkers: Classification, Early Detection Technologies and Emerging Research Trends. Biosensors. 2025; 15(11):737. https://doi.org/10.3390/bios15110737
47. Jones, C.H.; Dolsten, M. Healthcare on the brink: Navigating the challenges of an aging society in the United States. NPJ Aging 2024, 10, 22.
48. Pais-Magalhães, V.; Moutinho, V.; Robaina, M. Is an ageing population impacting energy use in the European Union? Drivers, lifestyles, and consumption patterns of elderly households. Energy Res. Soc. Sci. 2022, 85, 102443.
49. Jarzebski, M.P.; Elmqvist, T.; Gasparatos, A.; Fukushi, K.; Eckersten, S.; Haase, D.; Goodness, J.; Khoshkar, S.; Saito, O.; Takeuchi, K.; et al. Ageing and population shrinking: Implications for sustainability in the urban century. NPJ Urban Sustain. 2021, 1, 17.
