Hybrid Sign Language Recognition Framework Leveraging MobileNetV3, Mult-Head Self Attention and LightGBM
Abstract
Sign-language recognition (SLR) plays a pivotal role in enhancing communication accessibility and fostering the inclusion of deaf communities. Despite significant advancements in SLR systems, challenges such as variability in sign language gestures, the need for real-time processing, and the complexity of capturing spatiotemporal dependencies remain unresolved. This study aims to address these limitations by proposing an advanced framework that integrates deep learning and machine learning techniques to optimize sign language recognition systems, with a focus on the Indian Sign Language (ISL) dataset. The framework leverages MobileNetV3 for feature extraction, which is selected after rigorous evaluation against VGG16, ResNet50, and EfficientNet-B0. MobileNetV3 demonstrates superior accuracy and efficiency, making it optimal for this task. To enhance the model's ability to capture complex dependencies and contextual information, multi-head self-attention (MHSA) was incorporated. This process enriches the extracted features, enabling a better understanding of sign language gestures. Finally, LightGBM, a gradient-boosting algorithm that is efficient for large-scale datasets, was employed for classification. The proposed framework achieved remarkable results, with a test accuracy of 98.42%, precision of 98.19%, recall of 98.81%, and an F1-score of 98.15%. The integration of MobileNetV3, MHSA, and LightGBM offers a robust and adaptable solution that outperforms the existing methods, demonstrating its potential for real-world deployment. In conclusion, this study advances precise and accessible communication technologies for deaf individuals, contributing to more inclusive and effective human-computer interaction systems. The proposed framework represents a significant step forward in SLR research by addressing the challenges of variability, real-time processing, and spatiotemporal dependency. Future work will expand the dataset to include more diverse gestures and environmental conditions and explore cross-lingual adaptations to enhance the model’s applicability and impact.
Downloads
References
M. Mahyoub, F. Natalia, S. Sudirman, and J. Mustafina, “Sign Language Recognition using Deep Learning,” 2021 14th International Conference on Developments in eSystems Engineering (DeSE), pp. 184–189, Jan. 2023, doi: 10.1109/dese58274.2023.10100055.
Koller, O. (2020). Quantitative Survey of the State of the Art in Sign Language Recognition. arXiv (CornellUniversity). https://doi.org/10.48550/arxiv.2008.09918
Padden, C., & Humphries, T. (2009). Inside Deaf Culture. https://doi.org/10.2307/j.ctvjz83v3
Stokoe, W. C. (2004). Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf. The Journal of Deaf Studies and Deaf Education, 10(1), 3–37. https://doi.org/10.1093/deafed/eni001
Cooper, H., Holt, B., & Bowden, R. (2011). Sign Language Recognition. In Springer eBooks (pp. 539–562). https://doi.org/10.1007/978-0-85729-997-0_27
Cui, R., Liu, H., & Zhang, C. (2019). A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE Transactions on Multimedia, 21(7), 1880–1891. https://doi.org/10.1109/tmm.2018.2889563
Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., Caselli, N., Huenerfauth, M., Kacorri, H., Verhoef, T., Vogler, C., & Morris, M. R. (2019). Sign Language Recognition, Generation, and Translation. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (pp. 16–31). https://doi.org/10.1145/3308561.3353774
Attia, N. F., Ahmed, M. T. F. S., & Alshewimy, M. A. (2023). Efficient deep learning models based on tension techniques for sign language recognition. Intelligent Systems With Applications, 20, 200284. https://doi.org/10.1016/j.iswa.2023.200284
Kumar, C. M. N., Vanitha, A., Lavanya, N. Y., Lekhana, N. C., Tasmiya, R., & Nisarga, L. D. (2024). Deep learning-based recognition of sign language. Second International Conference on Data Science and Information System, 1–6. https://doi.org/10.1109/icdsis61070.2024.10594011
Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil-González, A., & Corchado, J. M. (2022). Deepsign: Sign Language Detection and Recognition Using Deep Learning. Electronics, 11(11), 1780. https://doi.org/10.3390/electronics11111780
Ashrafi, A., Mokhnachev, V. S., & Harlamenkov, A. E. (2024). Improving Sign Language Recognition with Machine Learning and Artificial Intelligence. 2022 4th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE), 1–6. https://doi.org/10.1109/reepe60449.2024.10479844
Rajasekhar, N., Yadav, M. G., Vedantam, C., Pellakuru, K., & Navapete, C. (2023). Sign Language Recognition using Machine Learning Algorithm. In International Conference on Sustainable Computing and Smart Systems (ICSCSS) (Vol. 9, pp. 303–306). https://doi.org/10.1109/icscss57650.2023.10169820
Ranjbar, H., & Taheri, A. (2024). Continuous Sign Language Recognition Using Intra-inter Gloss Attention. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2406.18333
Kumari, D., & Anand, R. S. (2024). Isolated Video-Based Sign Language Recognition Using a Hybrid CNN-LSTM Framework Based on Attention Mechanism. Electronics, 13(7), 1229. https://doi.org/10.3390/electronics13071229
Sarhan, N., Wilms, C., Closius, V., Brefeld, U., & Frintrop, S. (2023). Hands in Focus: Sign Language Recognition Via Top-Down Attention. 2022 IEEE International Conference on Image Processing (ICIP), 2555–2559. https://doi.org/10.1109/icip49359.2023.10222729
Ma, Y., Xu, T., & Kim, K. (2022). A Digital Sign Language Recognition based on a 3D-CNN System with an Attention Mechanism. 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), 1–4. https://doi.org/10.1109/icce-asia57006.2022.9954810
Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., & Le, Q. (2019). Searching for MobileNetV3. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 1314–1324. https://doi.org/10.1109/iccv.2019.00140
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. arXiv (Cornell University), 30, 5998–6008. https://arxiv.org/pdf/1706.03762v5
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In 31st International Conference on Neural Information Processing Systems. https://hal.science/hal-03953007
LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K. (2012). Efficient BackProp. In Lecture notes in computer science (pp. 9–48). https://doi.org/10.1007/978-3-642-35289-8_3
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). https://doi.org/10.1145/2939672.2939785
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-Excitation Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/cvpr.2018.00745
Kumar, H., Dwivedi, A., Mishra, A. K., Shukla, A. K., Sharma, B. K., Agarwal, R., & Kumar, S. (2024). Transformer-based decoder of melanoma classification using hand-crafted texture feature fusion and Gray Wolf Optimization algorithm. MethodsX, 13, 102839. https://doi.org/10.1016/j.mex.2024.102839
Indian Sign Language (ISL). (2021, June 4). Kaggle. https://www.kaggle.com/datasets/prathumarikeri/indian-sign-language-isl
D. Kumari and R. S. Anand, “Fusion of Attention-Based Convolution Neural Network and HOG features for static sign language recognition,” Applied Sciences, vol. 13, no. 21, p. 11993, Nov. 2023, doi: 10.3390/app132111993.
S. Biswas, R. Saw, A. Nandy, and A. K. Naskar, “Attention-enabled hybrid convolutional neural network for enhancing human–robot collaboration through hand gesture recognition,” Computers & Electrical Engineering, vol. 123, p. 110020, Dec. 2024, doi: 10.1016/j.compeleceng.2024.110020.
Copyright (c) 2025 Hemant Kumar, Rishabh Sachan, Mamta Tiwari, Amit Kumar Katiyar, Namita Awasthi, Puspha Mamoria

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).