ParsAirCall: Automated Conversational IVR in Airport Call Center using Deep Transfer Learning
Subject Areas : Multimedia Processing, Communications Systems, Intelligent SystemsSoheil Tehranipour 1 , Mohammad Manthouri 2 * , Samaneh Yazdani 3
1 - MS Student, Department of Computer Engineering, North Tehran Branch, Islamic Azad University, Tehran, Iran
2 - Assistant Professor, Department of Electrical and Electronic Engineering, Shahed University, Tehran, Iran
3 - Assistant Professor, Department of Computer Engineering, North Tehran Branch, Islamic Azad University, Tehran, Iran
Keywords: Call center, Automatic speech recognition, Deep transfer learning, Airport smart systems,
Abstract :
Introduction: In this paper, we introduce the ParsAirCall toolkit, which is a tool for automatic recognition of Persian numbers in airport systems. It leverages deep transfer learning to improve performance in real and operational scenarios of voice-controlled smart telephone systems at airports across the country. In today's world, with the advancements in artificial intelligence, traditional systems for interacting with callers in telephone calls are not efficient, and this efficiency will be enhanced through automation and the automation of repetitive tasks.
Method: ParsAirCall distinguishes itself by surpassing competing models in the Persian language, achieving heightened accuracy with fewer parameters and optimized computing resources. Addressing the challenge posed by limited data for Persian speech recognition, we meticulously curated a 30-hour telephony dataset, serving as the cornerstone for training the final ParsAirCall model. Embracing the innovative QuartzNet architecture, our deep transfer learning strategy empowers ParsAirCall to capture nuanced features in Persian speech, ensuring superior performance in number recognition tasks associated with airport telephone calls.
Results: Experiments were conducted on both our collected telephony dataset and the Common Voice project, demonstrating ParsAirCall’s efficiency in achieving a 2.7% WER (Word Error Rate) in number recognition in airport telephone calls.
Discussion: ParsAirCall emerges as a versatile tool, poised for seamless integration as a service into any Persian-language airport telephone system. Its practical application extends to number recognition in airport call centers, exemplifying the transformative impact of advanced technologies in streamlining communication processes within critical operational environments. ParsAirCall can be easily integrated as a service into any Persian-language airport telephone system, making it a practical tool for number recognition in airport call centers and telephone systems.
1. Deshpande, G., A. Batliner, and B.W. Schuller, AI-Based human audio processing for COVID-19: A comprehensive overview. Pattern recognition, 2022. 122: p. 108289.
2. Agarwal, P., S. Swami, and S.K. Malhotra, Artificial intelligence adoption in the post COVID-19 new-normal and role of smart technologies in transforming business: a review. Journal of Science and Technology Policy Management, 2022.
3. Mitreska, M., et al., Representation Learning for Automatic Speech Recognition: A Review of Speech-to-Text Methods. 2023.
4. Young, S.J. and S. Young, The HTK hidden Markov model toolkit: Design and philosophy. 1993.
5. Veisi, H. and A. Haji Mani, Persian speech recognition using deep learning. International Journal of Speech Technology, 2020. 23: p. 893-905.
6. Shafieian, M., Hidden Markov model and Persian speech recognition. International Journal of Nonlinear Analysis and Applications, 2023. 14(1): p. 3111-3119.
7. Hinton, G., et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 2012. 29(6): p. 82-97.
8. Farahani, M., et al., Parsbert: Transformer-based model for persian language understanding. Neural Processing Letters, 2021. 53: p. 3831-3847.
9. Gonbadi, L. and N. Ranjbar, Sentiment Analysis of People’s opinion about Iranian National Cars with BERT.
10. Farhadi, A., et al., Unsupervised Domain Adaptation for image classification based on Deep Neural Networks. Intelligent Multimedia Processing and Communication Systems (IMPCS), 2023. 4(1): p. 27-37.
11. Salazar, J., K. Kirchhoff, and Z. Huang. Self-attention networks for connectionist temporal classification in speech recognition. in Icassp 2019-2019 ieee international conference on acoustics, speech and signal processing (icassp). 2019. IEEE.
12. Perero-Codosero, J.M., et al. Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription. in IberSPEECH. 2018.
13. Hannun, A., et al., Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
14. Amodei, D., et al. Deep speech 2: End-to-end speech recognition in english and mandarin. in International conference on machine learning. 2016. PMLR.
15. Hsiao, R., et al., Online Automatic Speech Recognition With Listen, Attend and Spell Model. IEEE Signal Processing Letters, 2020. 27: p. 1889-1893.
16. Collobert, R., C. Puhrsch, and G. Synnaeve, Wav2letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193, 2016.
17. Baevski, A., et al., wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.
18. Peng, C.-J., et al., Attention-based multi-task learning for speech-enhancement and speaker-identification in multi-speaker dialogue scenario. arXiv preprint arXiv:2101.02550, 2021.
19. Radford, A., et al. Robust speech recognition via large-scale weak supervision. in International Conference on Machine Learning. 2023. PMLR.
20. Ardila, R., et al., Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019.
21. Kriman, S., et al. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. IEEE.