Semantic-Filtered SMOTE-PSO for Breast Cancer Trial Eligibility Classification

Taslim; Mumtazimah Mohamad

doi:10.35882/jeeemi.v8i3.1706

Taslim Faculty of Computer Science, Universitas Lancang Kuning, Pekanbaru, Indonesia; Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Terengganu, Malaysia https://orcid.org/0009-0008-0025-6301
Mumtazimah Mohamad Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Terengganu, Malaysia; Artificial Intelligence Research Centre for Islam Sustainability, Universiti Sultan Zainal Abidin, Terengganu, Malaysia https://orcid.org/0000-0001-8151-6022

DOI: https://doi.org/10.35882/jeeemi.v8i3.1706

Keywords: BioBERT; clinical trial eligibility classification; class imbalance; SMOTE-PSO optimization; semantic filtering

Abstract

This study addresses breast cancer clinical trial eligibility classification from free-text criteria under severe class imbalance, a condition that biases learning toward the majority class and complicates screening decisions when false positives and false negatives carry different operational costs. The study evaluates whether semantic plausibility control and optimization improve classification performance and screening-oriented error trade-offs under imbalanced conditions. The main contribution of this study is the proposed BEACoN framework, which integrates semantic-filtered augmentation and PSO-guided optimization within a unified screening-oriented eligibility classification setting. Four BioBERT-BiLSTM variants were evaluated using fixed train-validation-test partitions across three random seeds: a baseline model (M1), SMOTE augmentation (M2), SMOTE with cosine filtering (M2.5), and the proposed BEACoN framework (M3). Performance was evaluated using Precision, Recall, F1, AUROC, and AUPRC with pooled multi-seed statistical analysis to improve robustness and reduce single-seed bias. The evaluated augmentation-based configurations achieved pooled F1 scores up to 0.9381 ± 0.0005, AUROC up to 0.9976 ± 0.0001, and AUPRC up to 0.9808 ± 0.0004, indicating improved screening-oriented classification performance relative to the baseline. However, SMOTE with cosine filtering behaved broadly similarly to standard SMOTE under the evaluated embedding setting, indicating that the selected cosine threshold functioned largely as a permissive constraint, although modest seed-dependent prediction differences were still observed. Although BEACoN did not demonstrate statistically significant superiority over SMOTE in aggregate performance, it provided a more balanced false-positive and false-negative trade-off under comparable classification performance. Overall, the findings suggest that plausibility-controlled augmentation may provide practical value for screening-oriented eligibility classification under severe class imbalance

Downloads

Download data is not yet available.

References

J. Le-Rademacher, H. Gunn, X. Yao, and D. J. Schaid, “Clinical Trials Overview: From Explanatory to Pragmatic Clinical Trials,” Mayo Clin. Proc., vol. 98, no. 8, pp. 1241–1253, Aug. 2023, doi: 10.1016/j.mayocp.2023.04.013.

S. Buccheri et al., “Large simple randomized controlled trials—from drugs to medical devices: lessons from recent experience,” Trials, vol. 26, no. 1, p. 24, 2025, doi: 10.1186/s13063-025-08724-x.

X. Lu, C. Yang, L. Liang, G. Hu, Z. Zhong, and Z. Jiang, “Artificial intelligence for optimizing recruitment and retention in clinical trials: A scoping review,” J. Am. Med. Informatics Assoc., vol. 31, no. 11, pp. 2749–2759, 2024, doi: 10.1093/jamia/ocae243.

V. Nanton et al., “Boosting and broadening recruitment to UK cancer trials: towards a blueprint for action,” BMJ Oncol., vol. 2, no. 1, p. e000092, Nov. 2023, doi: 10.1136/bmjonc-2023-000092.

K. Lee et al., “Optimizing Clinical Trial Eligibility Design Using Natural Language Processing Models and Real-World Data: Algorithm Development and Validation,” JMIR AI, vol. 3, p. e50800, 2024, doi: 10.2196/50800.

O. Unlu et al., “Manual vs AI-Assisted Prescreening for Trial Eligibility Using Large Language Models—A Randomized Clinical Trial,” JAMA, vol. 333, no. 12, pp. 1084–1087, Mar. 2025, doi: 10.1001/jama.2024.28047.

Q. Su, G. Cheng, and J. Huang, “A review of research on eligibility criteria for clinical trials,” Clin. Exp. Med., vol. 23, no. 6, pp. 1867–1879, 2023, doi: 10.1007/s10238-022-00975-1.

A. Heirali et al., “Eligibility Criteria of Randomized Clinical Trials in Critical Care Medicine,” JAMA Netw. Open, vol. 8, no. 1, pp. e2454944–e2454944, Jan. 2025, doi: 10.1001/jamanetworkopen.2024.54944.

C. Zihang, L. Liang, S. Qianmin, C. Gaoyi, H. Jihan, and L. Ying, “Enhanced pre-recruitment framework for clinical trial questionnaires through the integration of large language models and knowledge graphs,” Sci. Rep., vol. 15, no. 1, p. 27398, 2025, doi: 10.1038/s41598-025-11876-0.

M. Rybinski, W. Kusa, S. Karimi, and A. Hanbury, “Learning to match patients to clinical trials using large language models,” J. Biomed. Inform., vol. 159, p. 104734, 2024, doi: 10.1016/j.jbi.2024.104734.

S. Datta et al., “AutoCriteria: A generalizable clinical trial eligibility criteria extraction system powered by large language models,” J. Am. Med. Informatics Assoc., vol. 31, no. 2, pp. 375–385, Feb. 2024, doi: 10.1093/jamia/ocad218.

S. Gupta et al., “PRISM: Patient Records Interpretation for Semantic clinical trial Matching system using large language models,” npj Digit. Med., vol. 7, no. 1, p. 305, 2024, doi: 10.1038/s41746-024-01274-7.

K. Kantor and M. Morzy, “Machine learning and natural language processing in clinical trial eligibility criteria parsing: a scoping review,” Drug Discov. Today, vol. 29, no. 10, p. 104139, 2024, doi: 10.1016/j.drudis.2024.104139.

J. Li et al., “A comparative study of pre-trained language models for named entity recognition in clinical trial eligibility criteria from multiple corpora,” BMC Med. Inform. Decis. Mak., vol. 22, no. Suppl 3, p. 235, 2022, doi: 10.1186/s12911-022-01967-7.

J. Park et al., “Criteria2Query 3.0: Leveraging generative large language models for clinical trial eligibility query generation,” J. Biomed. Inform., vol. 154, p. 104649, 2024, doi: 10.1016/j.jbi.2024.104649.

Y. Han, Q. Su, L. Liu, Y. Li, and J. Huang, “Structural analysis and intelligent classification of clinical trial eligibility criteria based on deep learning and medical text mining,” J. Biomed. Inform., vol. 160, p. 104753, 2024, doi: 10.1016/j.jbi.2024.104753.

A. Bornet et al., “Analysis of Eligibility Criteria Clusters Based on Large Language Models for Clinical Trial Design,” J. Am. Med. Informatics Assoc., vol. 32, no. 3, pp. 447–458, Mar. 2025, doi: 10.1093/jamia/ocae311.

W. Kusa, O. E. Mendoza, P. Knoth, G. Pasi, and A. Hanbury, “Effective matching of patients to clinical trials using entity extraction and neural re-ranking,” J. Biomed. Inform., vol. 144, p. 104444, 2023, doi: 10.1016/j.jbi.2023.104444.

L. Gueguen, L. Olgiati, C. Brutti-Mairesse, A. Sans, V. Le Texier, and L. Verlingue, “A prospective pragmatic evaluation of automatic trial matching tools in a molecular tumor board,” npj Precis. Oncol., vol. 9, no. 1, p. 28, 2025, doi: 10.1038/s41698-025-00806-y.

Y. Yang, H. A. Khorshidi, and U. Aickelin, “A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems,” Front. Digit. Heal., vol. 6, p. 1430245, 2024, doi: 10.3389/fdgth.2024.1430245.

M. Salmi, D. Atif, D. Oliva, A. Abraham, and S. Ventura, “Handling imbalanced medical datasets: review of a decade of research,” Artif. Intell. Rev., vol. 57, no. 10, p. 273, 2024, doi: 10.1007/s10462-024-10884-2.

A. X. Wang, V.-T. Le, H. N. Trung, and B. P. Nguyen, “Addressing imbalance in health data: Synthetic minority oversampling using deep learning,” Comput. Biol. Med., vol. 188, p. 109830, 2025, doi: 10.1016/j.compbiomed.2025.109830.

S. F. Taskiran, B. Turkoglu, E. Kaya, and T. Asuroglu, “A comprehensive evaluation of oversampling techniques for enhancing text classification performance,” Sci. Rep., vol. 15, no. 1, p. 21631, 2025, doi: 10.1038/s41598-025-05791-7.

O. Abdelhay, A. Shatnawi, H. Najadat, and T. Altamimi, “Resampling Methods for Class Imbalance in Clinical Prediction Models: A Scoping Review Protocol,” PLoS One, vol. 20, no. 11, p. e0330050, 2025, doi: 10.1371/journal.pone.0330050.

J. Mao, K. Huang, and J. Liu, “MLAWSMOTE: Oversampling in Imbalanced Multi-label Classification with Missing Labels by Learning Label Correlation Matrix,” Int. J. Comput. Intell. Syst., vol. 17, no. 1, p. 205, 2024, doi: 10.1007/s44196-024-00607-4.

S. Nouas, L. Oukid, and F. Boumahdi, “Syngo: synthetic genetic oversampling technique for textual data,” Soc. Netw. Anal. Min., vol. 15, no. 1, p. 9, 2025, doi: 10.1007/s13278-025-01423-0.

S. Ray, A. N. Sarker, N. Chatterjee, K. Bhowmik, and S. Dey, “Leveraging Large Language Models for Clinical Trial Eligibility Criteria Classification,” Digital, vol. 5, no. 2, p. 12, 2025, doi: 10.3390/digital5020012.

X. Li and Q. Liu, “A hybrid sampling algorithm for imbalanced and class-overlap data based on natural neighbors and density estimation,” Knowl. Inf. Syst., vol. 67, no. 3, pp. 2259–2290, 2025, doi: 10.1007/s10115-024-02281-6.

V. Hernström et al., “Screening performance and characteristics of breast cancer detected in the Mammography Screening with Artificial Intelligence trial (MASAI): a randomised, controlled, parallel-group, non-inferiority, single-blinded, screening accuracy study,” Lancet Digit. Heal., vol. 7, no. 3, pp. e175–e183, Mar. 2025, doi: 10.1016/S2589-7500(24)00267-X.

auriml, “Clinical Trials on Cancer (EligibilitySample1000000.csv).” [Online]. Available: https://www.kaggle.com/datasets/auriml/eligibilityforcancerclinicaltrials. Accessed: May 2024.

K. raj Kanakarajan, B. Kundumani, A. Abraham, and M. Sankarasubbu, “{B}io{S}im{CSE}: {B}io{M}edical Sentence Embeddings using Contrastive learning,” in Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), A. Lavelli, E. Holderness, A. Jimeno Yepes, A.-L. Minard, J. Pustejovsky, and F. Rinaldi, Eds., Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics, Dec. 2022, pp. 81–86. doi: 10.18653/v1/2022.louhi-1.10.

J. S. Aguilar-Ruiz and M. Michalak, “Classification performance assessment for imbalanced multiclass data,” Sci. Rep., vol. 14, p. 10759, 2024, doi: 10.1038/s41598-024-61365-z.

A. A. Huang and S. Y. Huang, “Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations,” PLoS One, vol. 18, no. 2, p. e0281922, 2023, doi: 10.1371/journal.pone.0281922.

S. Idwan, W. Etaiwi, H. Rafayia, and I. Matar, “A comprehensive review of statistical variants and enhancements of SMOTE oversampling method,” Int. J. Data Sci. Anal., vol. 20, pp. 6887–6904, 2025, doi: 10.1007/s41060-025-00878-w.

R. Matsui, L. Guillen, S. Izumi, and T. Suganuma, “WISEST: Weighted Interpolation for Synthetic Enhancement Using SMOTE with Thresholds,” Sensors, vol. 25, no. 24, p. 7417, 2025, doi: 10.3390/s25247417.

H. Cui et al., “Textual similarity calculation techniques in the medical field: a retrospective review,” Appl. Intell., vol. 55, no. 11, p. 814, 2025, doi: 10.1007/s10489-025-06634-8.

J. Yao, X. Luo, F. Li, J. Li, J. Dou, and H. Luo, “Research on hybrid strategy Particle Swarm Optimization algorithm and its applications,” Sci. Rep., vol. 14, no. 1, p. 24928, 2024, doi: 10.1038/s41598-024-76010-y.

R. D. Riley et al., “Uncertainty of risk estimates from clinical prediction models: rationale, challenges, and approaches,” BMJ, vol. 388, p. e080749, 2025, doi: 10.1136/bmj-2024-080749.

F. M. Megahed, Y.-J. Chen, and N. Altman, “Comparing classifier performance with baselines,” Nat. Methods, vol. 21, no. 4, pp. 546–548, 2024, doi: 10.1038/s41592-024-02234-5.

K. I. Siddavatam and S. K. Shinde, “A hybrid literature review on handling imbalanced medical data: AI models and open issues,” Expert Syst. Appl., vol. 296, p. 129004, 2026, doi: 10.1016/j.eswa.2025.129004.

M. Chen et al., “Impact of human and artificial intelligence collaboration on workload reduction in medical image interpretation,” npj Digit. Med., vol. 7, no. 1, p. 349, 2024, doi: 10.1038/s41746-024-01328-w.

Q. Jin et al., “Matching patients to clinical trials with large language models,” Nat. Commun., vol. 15, p. 9074, 2024, doi: 10.1038/s41467-024-53081-z.

C. Lin and F. Leony, “Evidence-based adaptive oversampling algorithm for imbalanced classification,” Knowl. Inf. Syst., vol. 66, no. 3, pp. 2209–2233, 2024, doi: 10.1007/s10115-023-01985-5.

H. Chen et al., “Enhancing Patient-Trial Matching With Large Language Models: A Scoping Review of Emerging Applications and Approaches,” JCO Clin. Cancer Informatics, vol. 9, p. e2500071, 2025, doi: 10.1200/CCI-25-00071.

H. Lu, L. Ehwerhemuepha, and C. Rakovski, “A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance,” BMC Med. Res. Methodol., vol. 22, p. 181, 2022, doi: 10.1186/s12874-022-01665-y.

A. Peluso et al., “Deep learning uncertainty quantification for clinical text classification,” J. Biomed. Inform., vol. 149, p. 104576, 2024, doi: 10.1016/j.jbi.2023.104576.

Z. Premji and C. Cooper, “Same, same, but different: A method to harmonise and deduplicate study records from WHO ICTRP and ClinicalTrials.gov prior to screening,” Res. Synth. Methods, vol. 16, no. 4, pp. 587–600, 2025, doi: 10.1017/rsm.2025.20.