De-identification of Protected Health Information in Clinical Document Images using Deep Learning and Pattern Matching

Ravichandra Sriram; Siva Sathya S; Lourdumarie Sophie S

doi:10.35882/jeeemi.v7i1.616

Ravichandra Sriram Department of Computer Science, Pondicherry University, Pondicherry, India. https://orcid.org/0000-0001-9826-3957
Siva Sathya S Department of Computer Science, Pondicherry University, Pondicherry, India. https://orcid.org/0000-0003-1009-6504
Lourdumarie Sophie S Department of Computer Science, Pondicherry University, Pondicherry, India. https://orcid.org/0000-0001-5345-598X

DOI: https://doi.org/10.35882/jeeemi.v7i1.616

Keywords: clinical de-identification, Bio-Medical Data sharing, Document Image Analysis, Object Detection, Pattern Matching

Abstract

Clinical documents that include lab results, discharge summaries, and radiology reports of patients are generally used by doctors for diagnosis and treatment. However, with the popularization of AI in healthcare, clinical documents are also widely used by researchers for disease diagnosis, prediction, and developing schemes for quality healthcare delivery. Though huge volumes of clinical documents are produced in various hospitals every day, they are not shared with researchers for study purposes due to the sensitive nature of health records. Before sharing these documents, they must be de-identified, or the protected health information (PHI) should be removed for the purpose of preserving the patient's privacy. If the documents are stored digitally, this PHI can be easily identified and removed, but finding and extracting PHI from old clinical documents that are scanned and stored as images or other formats is quite a daunting task for which machine learning models have to be trained with a large number of such images. This work introduces a novel combination of deep learning and pattern matching algorithms for the efficient de-identification of scanned clinical documents, distinguishing it from previous methods, which can primarily work only on text documents and not on scanned clinical documents or images. Thus, a comprehensive de-identification technique for automatically extracting protected health information (PHI) from scanned images of clinical documents is proposed. For experimental purposes, we created a synthetic dataset of 700 clinical document images obtained from various patients across multiple hospitals. The de-identification framework comprises two phases: (1) Training of YoloV3- Document Layout Analysis (Yolo V3-DLA) which is a Deep learning model to segment the various regions in the clinical document. (2) Identifying regions containing PHI through pattern-matching techniques and deleting or anonymizing the information in those regions. The proposed method was implemented to identify regions based on content structure, facilitating the de-identification of PHI regions and achieving an F1 score of 0.97. This system can be readily adapted to accommodate any form of clinical document.

Downloads

Download data is not yet available.

References

States. U., “Health insurance portability and accountability act of 1996,” 1996.

Nguyen TT, Le H, Nguyen T, et al.,” A brief review of state-of-the-art object detectors on benchmark document images datasets.” International Journal on Document Analysis and Recognition (IJDAR),2023. doi.org/10.1007/s10032-023-00431-0

Redmon, Joseph and Ali Farhadi., “YOLOv3: An Incremental Improvement,” ArXiv abs/1804.02767, 2018.

Ren S, He K,Girshick R, et al., “Faster RCNN: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, 2015 doi.org/10.1109/TPAMI.2016.2577031

Ravichandra S, Siva Sathya S, Lourdu Marie Sophie S., “Deep learning based document layout analysis on historical documents.” In: Advances in Distributed Computing and Machine Learning. Springer Nature Singapore, pp 271–281, 2022. doi.org/https://doi.org/10.1007/978-981-19-1018-0 23

Zhang D, Mao R, Guo R, et al., “Yolo-table: disclosure document table detection with involution,” International Journal on Document Analysis and Recognition (IJDAR) vol. 26(1), pp. 1–14, 2023 doi.org/10.1007/s10032-022-00400-z

Yang H, Hsu W., “Vision-based layout detection from scientific literature using recurrent convolutional neural networks.” 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, pp. 6455–6462, 2021.doi.org/10.1109/ICPR48806.2021.9412557

Saeed M, Villarroel M, Reisner AT, et al., “Multiparameter intelligent monitoring in intensive care ii: A public-access intensive care unit database,” Critical Care Medicine, vo. 39, pp. 952–960, May. 2011. doi: 10.1097/CCM.0b013e31820a92c6.

Johnson, Alistair E W et al. “MIMIC-III, a freely accessible critical care database,” Scientific data, vol. 3 160035, 24 May. 2016. doi:10.1038/sdata.2016.35

Uzuner, Ozlem et al. “Evaluating the state-of-the-art in automatic de-identification,” Journal of the American Medical Informatics Association: JAMIA , vol. 14, no. 5, pp. 550-63, 2007. doi:10.1197/jamia.M2444

Stubbs, Amber, and Özlem Uzuner. “Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus,” Journal of Biomedical Informatics, vol. 58 Suppl, pp. S20-S29, 2015. doi:10.1016/j.jbi.2015.07.020

Stubbs, Amber, et al. “De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1,” Journal of Biomedical Informatics, vol. 75S, pp. S4-S18, 2017. doi:10.1016/j.jbi.2017.06.011

Sriram R, Sundaram SS, Sophie SL, “Deep learning models for automatic de-identification of clinical text,” In: Computer, Communication, and Signal Processing. AI, Knowledge Engineering and IoT for Smart Systems. Springer Nature Switzerland, pp. 116–127, 2023. doi.org/10.1007/978-3-031-39811-7 10

Kovacevic, Aleksandar et al. “De-identification of clinical free text using natural language processing: A systematic review of current approaches,” Artificial intelligence in medicine. Vol. 151, 2023. doi.org/10.1016/j.artmed.2024.102845.

Dernoncourt, Franck et al. “De-identification of patient notes with recurrent neural networks.” Journal of the American Medical Informatics Association : JAMIA vol. 24, no. 3, pp. 596-606, 2017. doi:10.1093/jamia/ocw156

Ahmed T, Aziz MMA, Mohammed N., “De-identification of electronic health record using neural network,” Scientific Reports vol. 10, 2020. doi.org/10.1038/s41598-020-75544-1

Catelli R, Casola V, De Pietro G, et al., “Combining contextualized word representation and sub-document level analysis through bi-lstm+crf architecture for clinical de-identification.” Know-Based System, vol. 213, 2021. doi.org/10.1016/j.knosys.2020.106649

Hartman T, Howell MD, Dean J, et al., “Customization scenarios for de-identification of clinical notes,” BMC Medical Informatics and Decision Making vol. 20(1), 2020:14. doi.org/10.1186/s12911-020-1026-2

Devlin J, Chang MW, Lee K, et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” ArXiv abs/1810.04805, 2019.

Johnson A, Bulgarelli L, Pollard T., “Deidentification of free-text medical records using pre-trained bidirectional transformers,” Proceedings of the ACM Conference on Health, Inference, and Learning, vol. pp. 214–221, 2020. doi.org/10.1145/3368555.3384455

Lin TY, Maire M, Belongie SJ, et al., “Microsoft coco: Common objects in context,” In: European Conference on Computer Vision, 2014.

T. -Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie., "Feature Pyramid Networks for Object Detection," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 936-944, 2017. doi: 10.1109/CVPR.2017.106.

Wang, Chien-Yao & Yeh, I-Hau & Liao, Hong-yuan., “You Only Learn One Representation: Unified Network for Multiple Tasks,” 10.48550/arXiv.2105.04206, 2021.

Beltagy, I., Lo, K., and Cohan, A. ”SciBERT: A Pretrained Language Model for Scientific Text”, Conference on Empirical Methods in Natural Language Processing., 2019. doi: 10.18653/v1/D19-1371

Jinhyuk L, Wonjin Y, et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”, Bioinformatics, Vol. 36, Issue. 4, pp. 1234–1240, February 2020, doi.org/10.1093/bioinformatics/btz682.

Alsentzer, E., Murphy, et al., “Publicly Available Clinical BERT Embeddings”. ArXiv, 2019, doi: https://arxiv.org/abs/1904.03323

Liu, Y., Ott, M., et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” ArXiv, 2019, doi: https://arxiv.org/abs/1907.11692

Neamatullah, I., Douglass, M.M., Lehman, Lw.H. et al. ”Automated de-identification of free-text medical records.” BMC Medical Informatics Decision Making, vol. 8, 32, 2008. doi: https://doi.org/10.1186/1472-6947-8-32

Cedric L, Bertrand L, et al., “Evaluating the Impact of Text De-Identification on Downstream NLP Tasks.”, In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 10–16, 2023.

Azzouzi, M.E., Coatrieux, G., Bellafqira, R. et al. “Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.”, BMC Medical Informatics and Decision Making, vol. 24(54), 2024. https://doi.org/10.1186/s12911-024-02422-5