Vision Language Transformer Framework for Efficient Cancer Diagnosis through Multimodal Integration
Abstract
Finding and treating cancer as soon as possible help patients get better outcomes. Patients requiring imaging or biopsy tests sometimes find it challenging to access them because these procedures are often limited by their high cost and availability in clinical settings. Recent AI methods, particularly those involving deep learning, can address these problems and significantly enhance the process for detecting cancer, offering greater efficiency and scalability. In this context, LLMs and VLMs are considered leading solutions for trying to make sense of multimodal variables within AI-driven healthcare systems. Although LLMs are strong at working with unstructured, clinically related text data, they have not often been used for patient assessment beyond descriptive or summarization tasks, by combining images and descriptions, along with both structured and unstructured data. The VLMs allow doctors and medical researchers to catch cancer symptoms from multiple angles. In this work, we study both LLMs and VLMs in cancer detection, analyzing their architectures, learning mechanisms, and performance on various datasets, and identifying directions for expanding multimodal AI in healthcare. Our results indicate that combining these two data types enhances how accurately we are able to diagnose patients across different types of cancer. Our studies in MIMIC-III, MIMIC-IV, TCGA, and CAMELYON 16/17 datasets revealed that multimodal transformer models significantly improve the accuracy of diagnosing biopsy results. In particular, BioViL achieves an AUC-ROC of 0.92 for detecting lung cancer, whereas CLIP Fine-tuned achieves a similar result of 0.91 for colon cancer detection.
Downloads
References
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North, vol. 1, no. https://aclanthology.org/N19-1423/, pp. 4171–4186, 2019, doi: https://doi.org/10.18653/v1/n19-1423.
J. Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, Sep. 2019, doi: https://doi.org/10.1093/bioinformatics/btz682.
E. Alsentzer et al., “Publicly Available Clinical BERT Embeddings,” arXiv:1904.03323 [cs], https://arxiv.org/abs/1904.03323v1, Jun. 2019, Available: https://arxiv.org/abs/1904.03323
Y. Peng, S. Yan, and Z. Lu, “Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets,” arXiv:1906.05474 [cs], https://arxiv.org/abs/1906.05474v1, Jun. 2019, Available: https://arxiv.org/abs/1906.05474
A. Rajkomar et al., “Scalable and accurate deep learning with electronic health records,” npj Digital Medicine, vol. 1, no. 1, May 2018, doi: https://doi.org/10.1038/s41746-018-0029-1.
A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” 2021. Available: https://proceedings.mlr.press/v139/radford21a/radford21a.pdf
S.-C. Huang, L. Shen, M. Lungren, and S. Yeung, “GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition.” Available: https://openaccess.thecvf.com/content/ICCV2021/papers/Huang_GLoRIA_A_Multimodal_Global-Local_Representation_Learning_Framework _for_Label-Efficient_Medical_ICCV_2021 _paper.pdf
Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “MedCLIP: Contrastive Learning from Unpaired Medical Images and Text,” 2022. Available: https://aclanthology.org/2022.emnlp-main.256.pdf
B. Boecking et al., “Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing,” Lecture Notes in Computer Science, https://arxiv.org/abs/2204.09817v1, pp. 1–21, 2022, doi: https://doi.org/10.1007/978-3-031-20059-5_1.
R. Chen et al., “Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images.” [Online]. Available: https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_Multimodal_Co-Attention_Transformer_for_Survival_Prediction_in_Gigapixel_Whole_Slide_ICCV_2021_paper.pdf
F. Liu, Y. Liu, X. Ren, X. He, and X. Sun, “Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations.” Accessed: Aug. 26, 2025. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/9fe77ac7060e716f2d42631d156825c0-Paper.pdf
N. Coudray et al., “Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning,” Nature Medicine, vol. 24, no. 10, pp. 1559–1567, Sep. 2018, doi: https://doi.org/10.1038/s41591-018-0177-5.
G. Campanella et al., “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images,” Nature Medicine, vol. 25, no. 8, pp. 1301–1309, Aug. 2019, doi: https://doi.org/10.1038/s41591-019-0508-1.
D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, “Deep Learning for Identifying Metastatic Breast Cancer,” arXiv.org, Jun. 18, 2016. https://arxiv.org/abs/1606.05718v1
A. Esteva et al., “Dermatologist-level Classification of Skin Cancer with Deep Neural Networks,” Nature, vol. 542, no. 7639, pp. 115–118, Jan. 2017, doi: https://doi.org/10.1038/nature21056.
A. Johnson et al., “OPEN SUBJECT CATEGORIES Background & Summary,” MIMIC-III, a Freely Accessible Critical Care Database, 2016, doi: https://doi.org/10.1038/sdata.2016.35.
A. E. W. Johnson et al., “MIMIC-IV, a freely accessible electronic health record dataset,” Scientific Data, vol. 10, no. 1, Jan. 2023, doi: https://doi.org/10.1038/s41597-022-01899-x.
J. Irvin et al., “CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 590–597, Jul. 2019, doi: https://doi.org/10.1609/aaai.v33i01.3301590.
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv:2010.11929 [cs], Oct. 2020, Available: https://arxiv.org/abs/2010.11929
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626, Oct. 2017, doi: https://doi.org/10.1109/iccv.2017.74.
A. Vaswani et al., “Attention Is All You Need,” arXiv.org, 2017. https://arxiv.org/abs/1706.03762
H. Tan and M. Bansal, “LXMERT: Learning Cross-Modality Encoder Representations from Transformers,” arXiv:1908.07490 [cs], Dec. 2019, Available: https://arxiv.org/abs/1908.07490
L. Hendricks, “Grounding Visual Explanations.” Accessed: Aug. 26, 2025. [Online]. Available: https://openaccess.thecvf.com/content_ECCV_2018/papers/Lisa_Anne_Hendricks_Grounding_Visual_Explanations_ECCV_2018_paper.pdf
Z. Cai, R. C. Poulos, A. Aref, P. J. Robinson, R. R. Reddel, and Q. Zhong, “DeePathNet: A Transformer-Based Deep Learning Model Integrating Multiomic Data with Cancer Pathways,” Cancer Research Communications, vol. 4, no. 12, pp. 3151–3164, Dec. 2024, doi: https://doi.org/10.1158/2767-9764.crc-24-0285.
G. Li et al., “Transformer-based AI technology improves early ovarian cancer diagnosis using cfDNA methylation markers,” Cell Reports Medicine, vol. 5, no. 8, p. 101666, Aug. 2024, doi: https://doi.org/10.1016/j.xcrm.2024.101666.
G. Ayana et al., “Vision-Transformer-Based Transfer Learning for Mammogram Classification,” Diagnostics, vol. 13, no. 2, p. 178, Jan. 2023, doi: https://doi.org/10.3390/diagnostics13020178.
T. Shahzad, T. Mazhar, S. M. Saqib, and K. Ouahada, “Transformer-inspired training principles based breast cancer prediction: combining EfficientNetB0 and ResNet50,” Scientific Reports, vol. 15, no. 1, Apr. 2025, doi: https://doi.org/10.1038/s41598-025-98523-w.
H. Yang, M. Yang, J. Chen, G. Yao, Q. Zou, and L. Jia, “Multimodal deep learning approaches for precision oncology: a comprehensive review,” Briefings in Bioinformatics, vol. 26, no. 1, Nov. 2024, doi: https://doi.org/10.1093/bib/bbae699.
Asim Waqas, A. Tripathi, R. P. Ramachandran, P. A. Stewart, and G. Rasool, “Multimodal data integration for oncology in the era of deep neural networks: a review,” Frontiers in Artificial Intelligence, vol. 7, Jul. 2024, doi: https://doi.org/10.3389/frai.2024.1408843.
Fatima-Zahrae Nakach, A. Idri, and Evgin Goceri, “A comprehensive investigation of multimodal deep learning fusion strategies for breast cancer classification,” Artificial Intelligence Review, vol. 57, no. 12, Oct. 2024, doi:https://doi.org/10.1007/s10462-024-10984-z.
Y. Gao et al., “An explainable longitudinal multi-modal fusion model for predicting neoadjuvant therapy response in women with breast cancer,” Nature Communications, vol. 15, no. 1, Nov. 2024, doi: https://doi.org/10.1038/s41467-024-53450-8.
L. Liu et al., “AutoCancer as an automated multimodal framework for early cancer detection,” iScience, vol. 27, no. 7, p. 110183, Jun. 2024, doi: https://doi.org/10.1016/j.isci.2024.110183.
A. Patel et al., “Cross Attention Transformers for Multi-modal Unsupervised Whole-Body PET Anomaly Detection,” Lecture notes in computer science, pp. 14–23, Jan. 2022, doi: https://doi.org/10.1007/978-3-031-18576-2_2.
R. Gupta and H. Lin, “Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions,” Arxiv.org, 2020. https://arxiv.org/html/2504.11373v1
J. Clusmann et al., “Prompt injection attacks on vision language models in oncology,” Nature Communications, vol. 16, no. 1, Feb. 2025, doi: https://doi.org/10.1038/s41467-024-55631-x.
Y. Yang et al., “Demographic bias of expert-level vision-language foundation models in medical imaging,” Science Advances, vol. 11, no. 13, Mar. 2025, doi: https://doi.org/10.1126/sciadv.adq0305.
Y. Luo, Hamed Hooshangnejad, W. Ngwa, and K. Ding, “Opportunities and challenges in lung cancer care in the era of large language models and vision language models,” Translational Lung Cancer Research, vol. 14, no. 5, pp. 1830–1847, May 2025, doi: https://doi.org/10.21037/tlcr-24-801.
Y. Wang et al., “Enhancing vision-language models for medical imaging: bridging the 3D gap with innovative slice selection,” Neurips.cc, 2025. https://proceedings.neurips.cc/paper_files/paper/2024/hash/b53513b83232116ae25f57a174a7c993-Abstract-Datasets_and_Benchmarks_Track.html
V. de, R. Ravazio, C. Mattjie, L. S. Kupssinskü, C. Maria, and R. C. Barros, “Unlocking The Potential Of Vision-Language Models For Mammography Analysis,” pp. 1–4, May 2024, doi:https://doi.org/10.1109/isbi56570.2024.10635683.
A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” arXiv:1807.03748 [cs, stat], Jan. 2019, Available: https://arxiv.org/abs/1807.03748
D. Kiela, A. Conneau, A. Jabri, and M. Nickel, “Learning Visually Grounded Sentence Representations,” arXiv.org, 2017. https://arxiv.org/abs/1707.06320
Copyright (c) 2025 Bala Gangadhara Gutam, Sunil Kumar Malchi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
 
							 
							



 
 .png)
.png)
.png)
.png)
.png)
