De-identification of Protected Health Information in Clinical Document Images using Deep Learning and Pattern Matching
Abstract
Clinical documents that include lab results, discharge summaries, and radiology reports of patients are generally used by doctors for diagnosis and treatment. However, with the popularization of AI in healthcare, clinical documents are also widely used by researchers for disease diagnosis, prediction, and developing schemes for quality healthcare delivery. Though huge volumes of clinical documents are produced in various hospitals every day, they are not shared with researchers for study purposes due to the sensitive nature of health records. Before sharing these documents, they must be de-identified, or the protected health information (PHI) should be removed for the purpose of preserving the patient's privacy. If the documents are stored digitally, this PHI can be easily identified and removed, but finding and extracting PHI from old clinical documents that are scanned and stored as images or other formats is quite a daunting task for which machine learning models have to be trained with a large number of such images. This work introduces a novel combination of deep learning and pattern matching algorithms for the efficient de-identification of scanned clinical documents, distinguishing it from previous methods, which can primarily work only on text documents and not on scanned clinical documents or images. Thus, a comprehensive de-identification technique for automatically extracting protected health information (PHI) from scanned images of clinical documents is proposed. For experimental purposes, we created a synthetic dataset of 700 clinical document images obtained from various patients across multiple hospitals. The de-identification framework comprises two phases: (1) Training of YoloV3- Document Layout Analysis (Yolo V3-DLA) which is a Deep learning model to segment the various regions in the clinical document. (2) Identifying regions containing PHI through pattern-matching techniques and deleting or anonymizing the information in those regions. The proposed method was implemented to identify regions based on content structure, facilitating the de-identification of PHI regions and achieving an F1 score of 0.97. This system can be readily adapted to accommodate any form of clinical document.
Downloads
References
States. U., “Health insurance portability and accountability act of 1996,” 1996.
Nguyen TT, Le H, Nguyen T, et al.,” A brief review of state-of-the-art object detectors on benchmark document images datasets.” International Journal on Document Analysis and Recognition (IJDAR),2023. doi.org/10.1007/s10032-023-00431-0
Redmon, Joseph and Ali Farhadi., “YOLOv3: An Incremental Improvement,” ArXiv abs/1804.02767, 2018.
Ren S, He K,Girshick R, et al., “Faster RCNN: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, 2015 doi.org/10.1109/TPAMI.2016.2577031
Ravichandra S, Siva Sathya S, Lourdu Marie Sophie S., “Deep learning based document layout analysis on historical documents.” In: Advances in Distributed Computing and Machine Learning. Springer Nature Singapore, pp 271–281, 2022. doi.org/https://doi.org/10.1007/978-981-19-1018-0 23
Zhang D, Mao R, Guo R, et al., “Yolo-table: disclosure document table detection with involution,” International Journal on Document Analysis and Recognition (IJDAR) vol. 26(1), pp. 1–14, 2023 doi.org/10.1007/s10032-022-00400-z
Yang H, Hsu W., “Vision-based layout detection from scientific literature using recurrent convolutional neural networks.” 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, pp. 6455–6462, 2021.doi.org/10.1109/ICPR48806.2021.9412557
Saeed M, Villarroel M, Reisner AT, et al., “Multiparameter intelligent monitoring in intensive care ii: A public-access intensive care unit database,” Critical Care Medicine, vo. 39, pp. 952–960, May. 2011. doi: 10.1097/CCM.0b013e31820a92c6.
Johnson, Alistair E W et al. “MIMIC-III, a freely accessible critical care database,” Scientific data, vol. 3 160035, 24 May. 2016. doi:10.1038/sdata.2016.35
Uzuner, Ozlem et al. “Evaluating the state-of-the-art in automatic de-identification,” Journal of the American Medical Informatics Association: JAMIA , vol. 14, no. 5, pp. 550-63, 2007. doi:10.1197/jamia.M2444
Stubbs, Amber, and Özlem Uzuner. “Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus,” Journal of Biomedical Informatics, vol. 58 Suppl, pp. S20-S29, 2015. doi:10.1016/j.jbi.2015.07.020
Stubbs, Amber, et al. “De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1,” Journal of Biomedical Informatics, vol. 75S, pp. S4-S18, 2017. doi:10.1016/j.jbi.2017.06.011
Sriram R, Sundaram SS, Sophie SL, “Deep learning models for automatic de-identification of clinical text,” In: Computer, Communication, and Signal Processing. AI, Knowledge Engineering and IoT for Smart Systems. Springer Nature Switzerland, pp. 116–127, 2023. doi.org/10.1007/978-3-031-39811-7 10
Kovacevic, Aleksandar et al. “De-identification of clinical free text using natural language processing: A systematic review of current approaches,” Artificial intelligence in medicine. Vol. 151, 2023. doi.org/10.1016/j.artmed.2024.102845.
Dernoncourt, Franck et al. “De-identification of patient notes with recurrent neural networks.” Journal of the American Medical Informatics Association : JAMIA vol. 24, no. 3, pp. 596-606, 2017. doi:10.1093/jamia/ocw156
Ahmed T, Aziz MMA, Mohammed N., “De-identification of electronic health record using neural network,” Scientific Reports vol. 10, 2020. doi.org/10.1038/s41598-020-75544-1
Catelli R, Casola V, De Pietro G, et al., “Combining contextualized word representation and sub-document level analysis through bi-lstm+crf architecture for clinical de-identification.” Know-Based System, vol. 213, 2021. doi.org/10.1016/j.knosys.2020.106649
Hartman T, Howell MD, Dean J, et al., “Customization scenarios for de-identification of clinical notes,” BMC Medical Informatics and Decision Making vol. 20(1), 2020:14. doi.org/10.1186/s12911-020-1026-2
Devlin J, Chang MW, Lee K, et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” ArXiv abs/1810.04805, 2019.
Johnson A, Bulgarelli L, Pollard T., “Deidentification of free-text medical records using pre-trained bidirectional transformers,” Proceedings of the ACM Conference on Health, Inference, and Learning, vol. pp. 214–221, 2020. doi.org/10.1145/3368555.3384455
Lin TY, Maire M, Belongie SJ, et al., “Microsoft coco: Common objects in context,” In: European Conference on Computer Vision, 2014.
T. -Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie., "Feature Pyramid Networks for Object Detection," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 936-944, 2017. doi: 10.1109/CVPR.2017.106.
Wang, Chien-Yao & Yeh, I-Hau & Liao, Hong-yuan., “You Only Learn One Representation: Unified Network for Multiple Tasks,” 10.48550/arXiv.2105.04206, 2021.
Beltagy, I., Lo, K., and Cohan, A. ”SciBERT: A Pretrained Language Model for Scientific Text”, Conference on Empirical Methods in Natural Language Processing., 2019. doi: 10.18653/v1/D19-1371
Jinhyuk L, Wonjin Y, et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”, Bioinformatics, Vol. 36, Issue. 4, pp. 1234–1240, February 2020, doi.org/10.1093/bioinformatics/btz682.
Alsentzer, E., Murphy, et al., “Publicly Available Clinical BERT Embeddings”. ArXiv, 2019, doi: https://arxiv.org/abs/1904.03323
Liu, Y., Ott, M., et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” ArXiv, 2019, doi: https://arxiv.org/abs/1907.11692
Neamatullah, I., Douglass, M.M., Lehman, Lw.H. et al. ”Automated de-identification of free-text medical records.” BMC Medical Informatics Decision Making, vol. 8, 32, 2008. doi: https://doi.org/10.1186/1472-6947-8-32
Cedric L, Bertrand L, et al., “Evaluating the Impact of Text De-Identification on Downstream NLP Tasks.”, In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 10–16, 2023.
Azzouzi, M.E., Coatrieux, G., Bellafqira, R. et al. “Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.”, BMC Medical Informatics and Decision Making, vol. 24(54), 2024. https://doi.org/10.1186/s12911-024-02422-5
Copyright (c) 2024 Ravichandra Sriram, Siva Sathya S, Lourdumarie Sophie S

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).