Hybrid Sign Language Recognition Framework Leveraging MobileNetV3, Mult-Head Self Attention and LightGBM

Hemant Kumar; Rishabh Sachan; Mamta Tiwari; Amit Kumar Katiyar; Namita Awasthi; Puspha Mamoria

doi:10.35882/jeeemi.v7i2.685

Hemant Kumar Chhatrapati Shahu Ji Maharaj University https://orcid.org/0000-0003-0603-4394
Rishabh Sachan KIET Group of Institutions
Mamta Tiwari Chhatrapati Shahu Ji Maharaj University
Amit Kumar Katiyar Chhatrapati Shahu Ji Maharaj University
Namita Awasthi Allenhouse Institute of Technology
Puspha Mamoria Chhatrapati Shahu Ji Maharaj University

DOI: https://doi.org/10.35882/jeeemi.v7i2.685

Keywords: Sign Language Recognition, Gesture Recognition, MobileNetV3, Multi-head Self-Attention (MHSA), LightGBM

Abstract

Sign-language recognition (SLR) plays a pivotal role in enhancing communication accessibility and fostering the inclusion of deaf communities. Despite significant advancements in SLR systems, challenges such as variability in sign language gestures, the need for real-time processing, and the complexity of capturing spatiotemporal dependencies remain unresolved. This study aims to address these limitations by proposing an advanced framework that integrates deep learning and machine learning techniques to optimize sign language recognition systems, with a focus on the Indian Sign Language (ISL) dataset. The framework leverages MobileNetV3 for feature extraction, which is selected after rigorous evaluation against VGG16, ResNet50, and EfficientNet-B0. MobileNetV3 demonstrates superior accuracy and efficiency, making it optimal for this task. To enhance the model's ability to capture complex dependencies and contextual information, multi-head self-attention (MHSA) was incorporated. This process enriches the extracted features, enabling a better understanding of sign language gestures. Finally, LightGBM, a gradient-boosting algorithm that is efficient for large-scale datasets, was employed for classification. The proposed framework achieved remarkable results, with a test accuracy of 98.42%, precision of 98.19%, recall of 98.81%, and an F1-score of 98.15%. The integration of MobileNetV3, MHSA, and LightGBM offers a robust and adaptable solution that outperforms the existing methods, demonstrating its potential for real-world deployment. In conclusion, this study advances precise and accessible communication technologies for deaf individuals, contributing to more inclusive and effective human-computer interaction systems. The proposed framework represents a significant step forward in SLR research by addressing the challenges of variability, real-time processing, and spatiotemporal dependency. Future work will expand the dataset to include more diverse gestures and environmental conditions and explore cross-lingual adaptations to enhance the model’s applicability and impact.

Downloads

References

M. Mahyoub, F. Natalia, S. Sudirman, and J. Mustafina, “Sign Language Recognition using Deep Learning,” 2021 14th International Conference on Developments in eSystems Engineering (DeSE), pp. 184–189, Jan. 2023, doi: 10.1109/dese58274.2023.10100055.

Koller, O. (2020). Quantitative Survey of the State of the Art in Sign Language Recognition. arXiv (CornellUniversity). https://doi.org/10.48550/arxiv.2008.09918

Padden, C., & Humphries, T. (2009). Inside Deaf Culture. https://doi.org/10.2307/j.ctvjz83v3

Stokoe, W. C. (2004). Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf. The Journal of Deaf Studies and Deaf Education, 10(1), 3–37. https://doi.org/10.1093/deafed/eni001

Cooper, H., Holt, B., & Bowden, R. (2011). Sign Language Recognition. In Springer eBooks (pp. 539–562). https://doi.org/10.1007/978-0-85729-997-0_27

Cui, R., Liu, H., & Zhang, C. (2019). A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE Transactions on Multimedia, 21(7), 1880–1891. https://doi.org/10.1109/tmm.2018.2889563

Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., Caselli, N., Huenerfauth, M., Kacorri, H., Verhoef, T., Vogler, C., & Morris, M. R. (2019). Sign Language Recognition, Generation, and Translation. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (pp. 16–31). https://doi.org/10.1145/3308561.3353774

Attia, N. F., Ahmed, M. T. F. S., & Alshewimy, M. A. (2023). Efficient deep learning models based on tension techniques for sign language recognition. Intelligent Systems With Applications, 20, 200284. https://doi.org/10.1016/j.iswa.2023.200284

Kumar, C. M. N., Vanitha, A., Lavanya, N. Y., Lekhana, N. C., Tasmiya, R., & Nisarga, L. D. (2024). Deep learning-based recognition of sign language. Second International Conference on Data Science and Information System, 1–6. https://doi.org/10.1109/icdsis61070.2024.10594011

Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil-González, A., & Corchado, J. M. (2022). Deepsign: Sign Language Detection and Recognition Using Deep Learning. Electronics, 11(11), 1780. https://doi.org/10.3390/electronics11111780

Ashrafi, A., Mokhnachev, V. S., & Harlamenkov, A. E. (2024). Improving Sign Language Recognition with Machine Learning and Artificial Intelligence. 2022 4th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE), 1–6. https://doi.org/10.1109/reepe60449.2024.10479844

Rajasekhar, N., Yadav, M. G., Vedantam, C., Pellakuru, K., & Navapete, C. (2023). Sign Language Recognition using Machine Learning Algorithm. In International Conference on Sustainable Computing and Smart Systems (ICSCSS) (Vol. 9, pp. 303–306). https://doi.org/10.1109/icscss57650.2023.10169820

Ranjbar, H., & Taheri, A. (2024). Continuous Sign Language Recognition Using Intra-inter Gloss Attention. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2406.18333

Kumari, D., & Anand, R. S. (2024). Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism. Electronics, 13(7), 1229. https://doi.org/10.3390/electronics13071229

Sarhan, N., Wilms, C., Closius, V., Brefeld, U., & Frintrop, S. (2023). Hands in Focus: Sign Language Recognition Via Top-Down Attention. 2022 IEEE International Conference on Image Processing (ICIP), 2555–2559. https://doi.org/10.1109/icip49359.2023.10222729

Ma, Y., Xu, T., & Kim, K. (2022). A Digital Sign Language Recognition based on a 3D-CNN System with an Attention Mechanism. 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), 1–4. https://doi.org/10.1109/icce-asia57006.2022.9954810

Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., & Le, Q. (2019). Searching for MobileNetV3. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 1314–1324. https://doi.org/10.1109/iccv.2019.00140

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. arXiv (Cornell University), 30, 5998–6008. https://arxiv.org/pdf/1706.03762v5

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In 31st International Conference on Neural Information Processing Systems. https://hal.science/hal-03953007

LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K. (2012). Efficient BackProp. In Lecture notes in computer science (pp. 9–48). https://doi.org/10.1007/978-3-642-35289-8_3

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). https://doi.org/10.1145/2939672.2939785

Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-Excitation Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/cvpr.2018.00745

Kumar, H., Dwivedi, A., Mishra, A. K., Shukla, A. K., Sharma, B. K., Agarwal, R., & Kumar, S. (2024). Transformer-based decoder of melanoma classification using hand-crafted texture feature fusion and Gray Wolf Optimization algorithm. MethodsX, 13, 102839. https://doi.org/10.1016/j.mex.2024.102839

Indian Sign Language (ISL). (2021, June 4). Kaggle. https://www.kaggle.com/datasets/prathumarikeri/indian-sign-language-isl

D. Kumari and R. S. Anand, “Fusion of Attention-Based Convolution Neural Network and HOG features for static sign language recognition,” Applied Sciences, vol. 13, no. 21, p. 11993, Nov. 2023, doi: 10.3390/app132111993.

S. Biswas, R. Saw, A. Nandy, and A. K. Naskar, “Attention-enabled hybrid convolutional neural network for enhancing human–robot collaboration through hand gesture recognition,” Computers & Electrical Engineering, vol. 123, p. 110020, Dec. 2024, doi: 10.1016/j.compeleceng.2024.110020.