DEVELOPMENT OF A TEXT-TO-SPEECH SYNTHESIS FOR YORUBA LANGUAGE USING DEEP LEARNING
Ayogu Bosede
Gabriel Ogunleye
Ogunrekun Patricia
Abstract
This study presents a comparative analysis of various machine learning techniques applied to URL-based phishing detection, with the primary objective of identifying optimal combinations of data types and models that enhance detection performance. The research explores four types of data representations, namely URL sequence, extracted lexical features, external domain features, and a hybrid combination of lexical and domain features. Additionally, some categories of machine learning models were evaluated: deep learning techniques, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Transformers, which utilize tokenized URL sequences as input, and traditional techniques, including Logistic Regression, Random Forest, Gradient Boosting Decision Tree (GBDT), and Fully Connected Networks (FCN), which employ lexical, domain, and hybrid representations as inputs. The evaluation of these models was conducted on a comprehensive dataset of phishing URLs, encompassing diverse attack vectors and complexities. Performance is assessed using accuracy, True Positive Rate (TPR), False Positive Rate (FPR), Receiver Operating Characteristic Area Under the Curve (ROC AUC), and inference time. Transformer model is a deep learning architecture designed for sequential data processing. The results has demonstrated that the Transformer model applied to tokenized URL sequences achieved a high accuracy at 96.41%. However, this model is comparatively slower in terms of inference time due to its deep architecture. On the other hand, the GBDT model with hybrid features outperforms other models with a notable accuracy of 96.85% while exhibiting faster inference times. This highlights the potential of using gradient boosting techniques in conjunction with a hybrid feature representation for efficient and accurate phishing detection.
References
[1] Abdelhamid, N., Ayesh, A., &Thabtah, F. (2014). Phishing detection based associative classification data mining. Expert Systems with Applications, 41(13), 5948-5959. [2] Adebowale, M. A., Lwin, K. T., Sanchez, E., & Hossain, M. A. (2019). Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text. Expert Systems with Applications, 115, 300-313. [3] Alrefaai, S., Özdemir, G., & Mohamed, A. (2022, June). Detecting Phishing Websites Using Machine Learning. In 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE. [4] Cao, Y., Han, W., & Le, Y. (2008, October). Anti-phishing based on automated individual white-list. In Proceedings of the 4th ACM workshop on Digital identity management (pp.51-60). [5] Cropper, A., Dumančić, S., Evans, R., & Muggleton, S. H. (2022). Inductive logic programming at 30. Machine Learning, 1-26. [6] Dam Minh Linh, Ha Duy Hung, Han Minh Chau, Quang Sy Vu, Thanh-Nam Tran (2024), Real-time phishing detection using deep learning methods by extensions, International Journal of Electrical and Computer Engineering (IJECE), Vol. 14, No. 3, June 2024, pp. 3021~3035. [7] Hannousse A, Yahiouche S. 2020. Securing microservices and microservice architectures: a systematic mapping study. Available at https://www.sciencedirect.com/science/article/abs/pii/ S1574013721000551. [8] Huang, H., Zhong, S., & Tan, J. (2009, August). Browser-side countermeasures for deceptive phishing attack. In 2009 Fifth International Conference on Information Assurance and Security (Vol. 1, pp. 352-355). IEEE. [9] Jain, A. K., & Gupta, B. B. (2016). A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP Journal on Information Security, 2016, 1-11. [10] Liang, Y., Deng, J., & Cui, B. (2020). Bidirectional LSTM: an innovative approach for phishing URL identification. In Innovative Mobile and Internet Services in Ubiquitous Computing: Proceedings of the 13th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS-2019) (pp. 326-337). Springer International Publishing. [11] Ling, L., Gao, Z., Silas, M. A., Lee, I., & Le Doeuff, E. A. (2019). An AI-based, Multi-stage detection system of banking botnets. arXiv preprint arXiv:1907.08276. [12] Liu, X., Teng, W., Wu, S., Wu, X., Liu, Y., & Ma, Z. (2021). Sparse dictionary learning based adversarial variational auto-encoders for fault identification of wind turbines. Measurement, 183, 109810. [13] Moedjahedy, J., Setyanto, A., Alarfaj, F. K., & Alreshoodi, M. (2022). CCrFS: combine correlation features selection for detecting phishing websites using machine learning. Future Internet, 14(8), 229. [14] Mohammad, R. M., Thabtah, F., &McCluskey, L. (2015). Tutorial and critical analysis of phishing websites methods. Computer Science Review, 17, 1-24. [15] Rendall, K., Nisioti, A., &Mylonas, A. (2020). Towards a multi-layered phishing detection. Sensors, 20(16), 4540. [16] Sahingoz, O. K., Buber, E., Demir, O., &Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345-357. [17] Sahingoz, O. K., Buber, E., Demir, O., &Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345-357. [18] Zhang, Y., Hong, J. I., &Cranor, L. F. (2007, May). Cantina: a content-based approach to detecting phishing web sites. In Proceedings of the 16th international conference on World Wide Web (pp. 639-648).