SympTextML: Leveraging Natural Language Symptom Descriptions for Accurate Multi-Disease Prediction

Dhairya Vyas; Milind Shah; Harsh Kantawala; Brijesh Patel; Tejas Patel; Jalaja Enamala

doi:10.35882/jeeemi.v7i3.946

Dhairya Vyas Computer Science and Engineering Department, The Maharaja Sayajirao University of Baroda, Gujarat, India https://orcid.org/0000-0001-7636-4167
Milind Shah Department of Computer Engineering, Sardar Vallabhbhai Patel Institute of Technology (S.V.I.T), Vasad, Gujarat, India https://orcid.org/0009-0001-6077-3924
Harsh Kantawala Department of Computer Engineering, G H Patel College of Engineering & Technology – The Charutar Vidhya Mandal University (CVM), Vallabh - Vidhyanagar, Gujarat, India https://orcid.org/0000-0002-7363-0770
Brijesh Patel Department of Computer Engineering, G H Patel College of Engineering & Technology – The Charutar Vidhya Mandal University (CVM), Vallabh - Vidhyanagar, Gujarat, India https://orcid.org/0000-0002-0132-6470
Tejas Patel Department of Computer Engineering, G H Patel College of Engineering & Technology – The Charutar Vidhya Mandal University (CVM), Vallabh - Vidhyanagar, Gujarat, India https://orcid.org/0009-0000-5154-978X
Jalaja Enamala Dhruva College of Management, Hyderabad, Telangana, India https://orcid.org/0000-0003-4818-1405

DOI: https://doi.org/10.35882/jeeemi.v7i3.946

Keywords: AI-driven framework, symptom classification, large language models, natural language processing, ensemble learning

Abstract

This research presents an AI-driven framework for multi-disease classification using natural language symptom descriptions, optimized through large language model (LLM) oriented preprocessing techniques. The proposed system integrates essential NLP steps text normalization, lemmatization, and n-gram vectorization to convert unstructured clinical symptom data into machine-readable form. A publicly available dataset comprising 8,498 samples across ten common diseases, including pneumonia, heart attack, diabetes, stroke, asthma, and depression, was used for training and evaluation. Data balancing and cleaning ensured uniform class representation with 1,200 samples per disease category. The processed dataset was subjected to supervised machine learning models, including SVM, KNN, Decision Tree, Random Forest, and Extra Trees, to identify the most effective classifier. Experimental results, conducted in Google Colab, showed that ensemble models (Random Forest and Extra Trees) significantly outperformed the others, achieving 99% accuracy, precision, recall, and F1-scores, while SVM and Decision Tree followed closely with 98% performance across metrics. Notably, the models consistently predicted pneumonia with high confidence for relevant input queries , validating the framework's robustness. This work demonstrates the efficacy of integrating LLM-compatible preprocessing with traditional ML classifiers for accurate disease detection based on symptom narratives. The proposed approach serves as a foundational step toward developing scalable, intelligent healthcare support systems capable of real-time disease prediction and decision-making assistance.

Downloads

Download data is not yet available.

References

R. S. Goodman, J. R. Patrinely, T. Osterman, L. Wheless, and D. B. Johnson, “On the cusp: Considering the impact of artificial intelligence language models in healthcare,” Med, vol. 4, no. 3, pp. 139–140, 2023, doi: 10.1016/j.medj.2023.02.008.

B. Zhou, G. Yang, Z. Shi, and S. Ma, “Natural Language Processing for Smart Healthcare,” IEEE Rev. Biomed. Eng., vol. 17, pp. 4–18, 2024, doi: 10.1109/RBME.2022.3210270.

S. Hirushit, S. Raja, S. Suwetha, and J. Yazhini, “AI Powered Personalized Healthcare Recommender,” 2nd Int. Conf. Artif. Intell. Mach. Learn. Appl. Healthc. Internet Things, AIMLA 2024, pp. 1–6, 2024, doi: 10.1109/AIMLA59606.2024.10531601.

G. Huang, Y. Li, S. Jameel, Y. Long, and G. Papanastasiou, “From explainable to interpretable deep learning for natural language processing in healthcare: How far from reality?,” Comput. Struct. Biotechnol. J., vol. 24, no. May, pp. 362–373, 2024, doi: 10.1016/j.csbj.2024.05.004.

S. Nasir, R. A. Khan, and S. Bai, “Ethical Framework for Harnessing the Power of AI in Healthcare and Beyond,” IEEE Access, vol. 12, no. March, pp. 31014–31035, 2024, doi: 10.1109/ACCESS.2024.3369912.

A. Sharma, S. Gupta, and S. K. Dubey, "Analysis on Symptoms Driven Disease Risk Assessment using Artificial Intelligence Approach," 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2024, pp. 1-7, doi: 10.1109/ICRITO61523.2024.10522221.

R. Katiyara, D. Katiyara, N. Iyer, M. Choudhary, and R. L. Priya, "MedEstimate: Patient Treatment Recommendation Model," 2022 5th International Conference on Advances in Science and Technology (ICAST), Mumbai, India, 2022, pp. 201-205, doi: 10.1109/ICAST55766.2022.10039522.

R. Kalia, R. Kumar, R. Kumar, and S. P. Singh, "Symptom based Clinical Decision Support System using various Machine learning models," 2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, 2023, pp. 174-178, doi: 10.1109/ICAC3N60023.2023.10541652.

M. A. Morid, O. R. L. Sheng, and J. Dunbar, “Time Series Prediction Using Deep Learning Methods in Healthcare,” ACM Trans. Manag. Inf. Syst., vol. 14, no. 1, 2023, doi: 10.1145/3531326.

M. Badawy, N. Ramadan, and H. A. Hefny, “Healthcare predictive analytics using machine learning and deep learning techniques: a survey,” J. Electr. Syst. Inf. Technol., vol. 10, no. 1, 2023, doi: 10.1186/s43067-023-00108-y.

A. A. Abdullah, M. M. Hassan, and Y. T. Mustafa, “A Review on Bayesian Deep Learning in Healthcare: Applications and Challenges,” IEEE Access, vol. 10, pp. 36538–36562, 2022, doi: 10.1109/ACCESS.2022.3163384.

A. Singla, “Roberta and BERT: Revolutionizing Mental Healthcare through Natural Language,” Shodh Sagar J. Artif. Intell. Mach. Learn., vol. 1, no. 1, pp. 10–27, 2024, doi: 10.36676/ssjaiml.v1.i1.02.

J. Au Yeung et al., “Natural language processing data services for healthcare providers,” BMC Med. Inform. Decis. Mak., vol. 24, no. 1, 2024, doi: 10.1186/s12911-024-02713-x.

K. Dubey, M. Bhowmik, A. Pawar, M. K. Patil, P. A. Deshpande, and S. S. Khartad, “Enhancing Operational Efficiency in Healthcare with AI-Powered Management,” Int. Conf. Artif. Intell. Innov. Healthc. Ind. ICAIIHI 2023, vol. 1, pp. 1–7, 2023, doi: 10.1109/ICAIIHI57871.2023.10488953.

S. P. Somashekhar et al., “Watson for Oncology and breast cancer treatment recommendations: Agreement with an expert multidisciplinary tumor board,” Ann. Oncol., vol. 29, no. 2, pp. 418–423, 2018, doi: 10.1093/annonc/mdx781.

B. Zhou, G. Yang, Z. Shi, and S. Ma, “Natural Language Processing for Smart Healthcare,” IEEE Rev. Biomed. Eng., vol. 17, pp. 4–18, 2024, doi: 10.1109/RBME.2022.3210270.

O. Arshi, A. Chaudhary, and R. Singh, “Navigating the Future of Healthcare: AI-Powered Solutions, Personalized Treatment Plans, and Emerging Trends in 2023,” Int. Conf. Artif. Intell. Innov. Healthc. Ind. ICAIIHI 2023, vol. 1, pp. 1–6, 2023, doi: 10.1109/ICAIIHI57871.2023.10489554.

O. Maki, M. Alshaikhli, M. Gunduz, K. K. Naji, and M. Abdulwahed, “Development of Digitalization Road Map for Healthcare Facility Management,” IEEE Access, vol. 10, pp. 14450–14462, 2022, doi: 10.1109/ACCESS.2022.3146341.

C. Landers, E. Vayena, J. Amann, and A. Blasimme, “Stuck in translation: Stakeholder perspectives on impediments to responsible digital health,” Front. Digit. Heal., vol. 5, no. February, pp. 1–14, 2023, doi: 10.3389/fdgth.2023.1069410.

Y. Choi et al., “Translating AI to Clinical Practice: Overcoming Data Shift with Explainability,” Radiographics, vol. 43, no. 5, 2023, doi: 10.1148/rg.220105.

A. Tiwari et al., “Symptoms are known by their companies: towards association guided disease diagnosis assistant,” BMC Bioinformatics, vol. 23, no. 1, pp. 1–23, 2022, doi: 10.1186/s12859-022-05032-y.

A. Tiwari, R. Raj, S. Saha, P. Bhattacharyya, S. Tiwari, and M. Dhar, "Toward Symptom Assessment Guided Symptom Investigation and Disease Diagnosis," IEEE Transactions on Artificial Intelligence, vol. 4, no. 6, pp. 1752-1766, Dec. 2023, doi: 10.1109/TAI.2023.3236897.

M. H. Kurniawan, H. Handiyani, T. Nuraini, R. T. S. Hariyati, and S. Sutrisno, “A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness,” Ann. Med., vol. 56, no. 1, p., 2024, doi: 10.1080/07853890.2024.2302980.

Z. Zhang, Y. Genc, D. Wang, M. E. Ahsen, and X. Fan, “Effect of AI Explanations on Human Perceptions of Patient-Facing AI-Powered Healthcare Systems,” J. Med. Syst., vol. 45, no. 6, 2021, doi: 10.1007/s10916-021-01743-6.

A. Bracken, C. Reilly, A. Feeley, E. Sheehan, K. Merghani, and I. Feeley, “Artificial Intelligence (AI) - Powered Documentation Systems in Healthcare: A Systematic Review,” J. Med. Syst., vol. 49, no. 1, p. 28, 2025, doi: 10.1007/s10916-025-02157-4.

R. Kumar, Arjunaditya, D. Singh, K. Srinivasan, and Y. C. Hu, “AI-Powered Blockchain Technology for Public Health: A Contemporary Review, Open Challenges, and Future Research Directions,” Healthc., vol. 11, no. 1, 2023, doi: 10.3390/healthcare11010081.

M. Golec, S. S. Gill, A. K. Parlikad, and S. Uhlig, “HealthFaaS: AI-Based Smart Healthcare System for Heart Patients Using Serverless Computing,” IEEE Internet Things J., vol. 10, no. 21, pp. 18469–18476, 2023, doi: 10.1109/JIOT.2023.3277500.

. B. Wen, R. Norel, J. Liu, T. Stappenbeck, F. Zulkernine, and H. Chen, “Leveraging Large Language Models for Patient Engagement: The Power of Conversational AI in Digital Health,” Proc. - 2024 IEEE Int. Conf. Digit. Heal. ICDH 2024, pp. 104–113, 2024, doi: 10.1109/ICDH62654.2024.00027.

D. Abisha, M. Mahalakshmi, T. Pritiga, M. Thanusiya, A. Punitha Sahaya Sherin, and R. Navedha Evanjalin, “Revolutionizing Rural Healthcare in India: AI-Powered Chatbots for Affordable Symptom Analysis and Medical Guidance,” 7th Int. Conf. Inven. Comput. Technol. ICICT 2024, no. Icict, pp. 181–187, 2024, doi: 10.1109/ICICT60155.2024.10544758.

S. Silvestri, S. Islam, D. Amelin, G. Weiler, S. Papastergiou, and M. Ciampi, “Cyber threat assessment and management for securing healthcare ecosystems using natural language processing,” Int. J. Inf. Secur., vol. 23, no. 1, pp. 31–50, 2024, doi: 10.1007/s10207-023-00769-w.