DEVELOPMENT OF A TEXT-TO-SPEECH SYNTHESIS FOR YORUBA LANGUAGE USING DEEP LEARNING
JETI Admin
Abstract
Speech synthesis, or Text-to-Speech (TTS) synthesis, is a cutting-edge field of computer science that enables computers to convert written text into natural-sounding speech. This emerging technology offers powerful capabilities that developers can explore to enhance user experiences and human-computer interactions. This research utilizes deep learning techniques, including variational inference and adversarial analysis, to improve the expressiveness and generative modelling of TTS systems. The study employs the BibleTTS corpus, consisting of audio recordings and text transcripts, to develop a TTS synthesis method that accommodates diverse rhythms and pitches, resulting in a natural one-to-many relationship between text input and speech output. The focal point of this study is the Yoruba language, a tonal language with distinct linguistic features. By aligning the Yoruba language using forced alignment, the proposed TTS system aims to produce a human-like voice with accurate tone and syllabic stringing. The research demonstrates that the deep learning approach used in this study outperforms affordable TTS systems and achieves a high Mean Opinion Score (MOS) comparable to ground truth. The proposed TTS synthesis system holds promising applications, including talking Global Positioning Systems (GPS), text reading in Word applications, and web-based text-to-audio conversion. As TTS synthesis gains momentum across different languages and industries, this study makes a significant contribution by advancing TTS technology for the Yoruba language, fostering accessibility and human-computer interactions in the digital era.
References
1. Afolabi A.O & Wahab A.S. (2012). Implementation of Yoruba Text-To-Speech E- Learning System. Ladoke Akintola University of Technology, Ogbomoso. Nigeria. 1-2. 2. Arthur C. Biography, (2017). Archived from the original on December 11, 1997. Retrieved 5-6 3. Binkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, N., ́ Casagrande E. N., Charniak, E. (2016) "Parsing as Language Modeling". Emnlp. Archived from the original on 2018-10-23. 10-22. 4. Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017) Density estimation using Real NVP. In International Conference on Learning Representations, 101-110. 5. Donahue, J., Dieleman, S., Binkowski, M., Elsen, E., and Simonyan, K. (2021). End-to- end Adversarial Text-to-Speech. In the International Conference on Learning Representations, URL https://openreview.net/forum? id=rsf1z-JSj87. 6. Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., & Bengio Y. (2013): Generative Adversarial Nets. Advances in Neural Information Processing Systems, 27:2672–2680, 2014. Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. 7. Gulati, S., Vaswani, A., Ahuja, A., Gandhi, V., Chan, S., Zhang, Y., ... & Wu, Y. (2021). Conformer: Convolution-augmented Transformer for Speech Recognition. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 5806-5815. 8. Gyan, B., et al. (2022). Enhancing speech synthesis for the Yoruba language. Journal of Language Technology and Computational Linguistics, 36(2), 87-104. 9. Hayashi, T., Inaguma, H., Ozaki, H., Yamamoto, R., Takeda, K., & Aizawa, A. (2021). ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit. Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2021). 10. Itunuoluwa Isewon, Jelili Oyelade and Olufunke Oladipupo (2014). Design and Implementation of Text-to-Speech Conversion for Visually Impaired People "Review of text-to-speech conversion for English". Journal of the Acoustical Society of America. 82 (3):73793. Bibcode:1987ASAJ...82.737K. 11. Kim J. Kong J. & Son J., “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021. 12. Kingma D. P. & Welling, M. (2014): Auto-Encoding Variational Bayes. In International Conference on Learning Representations. 19-21. 13. Kong J., Kim, J., & Bae J. (2020) HiFi-GAN: Generative Adversarial networks for Efficient and High-Fidelity Speech Synthesis. Advances in Neural Information Processing Systems, 33. 14. Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., Bengio, Y., and ́ Courville, A. C. (2019). MelGAN: Generative Adversarial Networks for Conditional waveform synthesis. volume 32, 14910–14921. 15. Larsen, A. B. L., Sønderby, S. K., Larochelle, H., and Winther, O. (2016). Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning, pp. 1558–1566. PMLR. 16. Loshchilov & Hutter F., (2017) “Decoupled weight decay regularization,”. 17. Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794–2802, 2017. 18. Odetunji O. A., (2006): “A Quantitative Model of Yorùbá Speech Intonation Using Stem. Proceedings from Conference on Human Language Technology for Development, Alexandria, Egypt. 1-11 19. Odejobi O.A., Beaumont A.J. and Wong. (2006): speech synthesis A fuzzy computational approach 10-16. 20. Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. FastSpeech (2021): Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=piLPYqxtWuA. 21. Schröder M. “Open-source voice creation toolkit for the Mary-TTS platform,” in INTER- SPEECH, 2011. 22. Shen, R, John T. & Pang R. (2018) “Natural TTS synthesis by conditioning wavenet on mel spectrogram predic- tions,” in ICASSP. IEEE. 23. Sproat, R. W., Olive, J.P., and Hirschberg, J., (1997). Progress in Speech Synthesis. Springer. 24. Suendermann, D., Höge, H., and Black, A., 2010. Challenges in Speech Synthesis. Chen, F., Jokinen, K., (eds.), Speech Technology, Springer Science + Business Media LLC. 25. Yakubu, M. A., et al. (2019). Deep learning for Yoruba text-to-speech synthesis. Proceedings of the International Conference on Neural Information Processing (ICONIP). Zhu, C., et al. (2021). Recent advances in text-to-speech synthesis: From concatenative to parametric approaches. IEEE Signal Processing Magazine, 38(3), 51-66.